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conneccionist  implementations  of  signal  processing  functions  such  as  filtiTing, 

transforms,  convolution  and  pattern  classification  and  clustering.  The  topics 

which  have  been  found  to  be  the  most  promising,  in  as  far  as  their  novelty, 

and  applicability  to  the  classification  of  pulses  are  covered  in  the  following 

four  sections: 

1.  Algebraic  Transforms  and  Classifiers 

2.  An  Interaction  Model  of  Neural-Net  based  Associative  Memories 

3.  Models  of  Adaptive  Neural-Net  based  Pattern  Classifiers 

4.  Discrete  Fourier  Transforms  on  Hypercubes 

NOTE:  Rome  Laboratory/RL  (formerly  Rome  Air  Development  Center/RADD 


1 4.  SUBJECT  TERMS 

.'in  Interaction  Model  of  Neur.al-Net  based  Assoc  i.itive  Memories, 
Content  Addressable  Memories,  Discrete  Fourier  Transforms 


IS  ^4UMaE3  OF 


i«  pptcE  coce 


1  7.  SECURITY  CLASSIFICATION 

OF  REPORT 

ia  SECURITY  CUSSIFICAT10N 
OF  THIS  PAGE 

ia  SECURITY  CLASSIFICATION 
OF  ABSTRACT 

120.  UMTTATION  OF  ABSTRACT 

UNCLASSIFIED 

UNCLASSIFIED 

UNCLASSIFIED 

UL 

Advanced  Signal  Processing  Techniques 


Technical  Summary 


Our  research  efforts  have  been  in  the  area  of  parallel  and  connectionist 
implementations  of  signal  processing  functions  such  as  filtering,  transforms, 
convolution  and  pattern  classification  and  clustering. 

Much  more  than  is  reported  here  was  reviewed  and  entertained  for  further  study 
and  potential  implementation.  For  example,  we  considered  in  detail 
implementation  of  techniques  common  to  all  filtering  such  as  the  computation  of 
recursive  linear  equations  using  systolic  arrays  and  pipeline  and  vector  processors. 
Similarly,  various  neural  architectures  were  studied  and  implemented.  These 
include  Hopfield,  Hamming,  Back  Propagation  and  Adaptive  Resonance  Theoiy 
Networks. 

We  report  here  those  topics  which  we  found  to  be  most  promising  in  as  far  as  their 
novelty  and  applicability  to  the  classification  of  pulses  and  additionally  on  a  new, 
more  transparent  method  for  the  computation  of  the  discrete  fourier  transform  on 
hypercubes.  The  report  consists  of  four  sections; 

Algebraic  Transforms  and  Classifiers  presents  a  two-layer  neural  net  classifier  with 
the  first  layer  mapping  from  feature  space  to  a  transform  or  code  space  and  the 
second  layer  performing  the  decoding  to  identify  the  required  class  label.  This 
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research  strongly  suggests  the  further  investigation  of  new  types  of  neural  net 
classifiers  whose  kernels  can  be  identified  with  linear  and  nonlinear  algebraic 
binary  codes. 

An  Interaction  Model  of  Neural-Net  based  Associative  Memories  presents  «  new 
model  for  Content  Addressable  Memories  which  takes  into  account  pattern 
interactions  generated  dynamically  over  time.  It  is  hoped  that  this  proposed 
scheme  will  yield  low  error  rates. 

Models  of  Adaptive  Neural-net  based  Pattern  Classifiers  deals  with  design  and 
implementation  issues  at  the  operational  level  of  ART-type  parallel  distributed 
architectures  in  which  learning  is  asynchronously  mediated  via  a  set  of  concurrent 
group  nodes  comprising  a  number  of  elemental  processors  working  on  component 
substrings  of  the  input  pattern  strings.  A  single-slab  Bland-net  alternative 
architecture  is  also  proposed. 


Discrete  Fourier  Transforms  on  Hypercubes  presents  a  decomposition  of  the 
transform  matrix  that  is  easier  to  understand  than  others  found  in  the  literature 


and  a  basic  scheme  for  power-of-prime  length  transform  computations  on  the 
hypercube.  For  lengths  that  are  not  a  power  of  a  prime,  the  use  of  the  Chinese 
Remainder  Theorem  in  pipelining  the  computation  is  proposed  and  discussed. 


ALGEBRAIC  TRANSFORMS  AND  CLASSIFIERS 


A  two-layer  neural  net  classifier  is  proposed  and  studied.  Codewords  from  a  {1,-1/ 
algebraic  code  are  ''wired'  in  the  net  to  correspond  to  the  class  labels  of  the  classification 
problem  studied.  The  first  layer  maps  from  feature  space  to  some  hypercube  {l,-l}^, 
while  the  second  layer  performs  algebraic  decoding  in  to  identify  the  codeword 

corresponding  to  the  required  class  label. 

The  main  motivation  for  introducing  the  proposed  architecture  is  the  existence  of  a  large 
body  of  knowledge  on  algebraic  codes  that  has  not  been  used  much  in  classification 
problems.  It  is  expected  that  classifiers  with  low  error  rates  can  be  built  through  the  use  of 
algebraic  coding  and  decoding. 

1.  Neural  network  classifiers 

Artificial  neural  nets  have  become  the  subject  of  intense  research  for  their  potential 
contribution  to  the  area  of  pattern  classification  (3).  The  goals  of  neural  net  classifier 
construction  are  manifold  and  include  the  building  of  lower  error  rate  classifiers  than 
classical  (i.e  Bayesian)  classifiers,  which  are  capable  of  good  generalization  from  training 
data,  <Lnd  having  only  reasonable  memory  and  training  time  requirements.  .Most  neural  m  l 
classifiers  proposed  in  the  literature  adjust  their  parameters  using  some  training  data.  In 
this  proposal,  training  has  been  set  aside  to  bring  out  possible  benefits  of  algebraic  coding. 
In  subsequent  work,  classifier  parameter  adjustment  will  be  studied  in  conjunction  with  the 
algebrmc  coding  approach. 

2.  Net  topology  and  operation 

The  neural  net  consists  of  three  layers  of  units  connected  in  a  feedforward  manner 

The  number  of  units  in  the  input  layer  is  the  length  k  of  each  feature  vector.  The  hidden 
layer  has  as  many  units  as  the  length  of  the  algebraic  code  used.  Finally  the  output  layer 
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contain  one  unit  for  each  class  in  the  classification  problem.  The  chosen  codewords  from 
the  algebraic  code  used  are  wired  into  the  net  via  the  interconnection  weights  between  tin 
input  2md  the  hidden  layers.  The  decoding  properties  of  the  algebraic  code  come  into  pla> 
in  the  interconnection  weights  between  the  hidden  zuid  the  ouput  layers.  The  actual  choice 
for  these  weights  will  be  detailed  below.  Note  that  even  though  the  topology  chosen  here 
resembles  other  popular  neural  net  topologies,  such  as  that  of  the  back-propagation 
network,  this  first  version  of  our  classifier  is  non-adaptive  and  non-learning  (the  pattern 
classes  are  not  learned  but  wired  in  from  some  a  priori  knowledge),  and  the  time  needed  to 
recognize  a  pattern  is  only  the  propagation  delay  through  the  two  layers  of  weights. 


Figure  1 
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The  network  operates  as  follows.  A  feature  vector  f  to  be  classified  is  presented  to  the 
input  layer;  this  feature  vector  is  a  corrupted  version  of  some  class  representative  1  (tlu' 
latter  associated  with  the  codeword  c).  The  second  layer  of  units  outputs  a  vector  u,  tlie 
’frequency  domain’  transform  of  the  feature  vector  f,  which  is  a  corrupted  version  of  c.  The 
output  layer  decodes  u  into  the  codeword  c,  thereby  identifying  the  class  represented  by  1. 
The  output  layer  does  this  by  having  all  units  ’ofT  except  the  one  corresponding  to 
codeword  c,  i.e  class  1. 

To  make  the  net  operate  as  required,  we  need  to  devise  schemes  for  wiring  codewords  in 
the  interconnection  weights  so  as  to  achieve  a  ’continuous’  mapping  from  feature  space  to 
code  space  (feature  vectors  that  are  close  correspond  to  vectors  in  {1,-1}*^  that  are  also 
close;  this  is  important  if  the  error  rate  is  to  be  kept  low),  decide  the  transformation 
performed  by  each  unit  and  find  a  way  of  milking  the  decoding  of  a  corrupted  codeword 
correspond  to  turning  on  exactly  one  unit  in  the  output  layer.  Finally,  we  must  analyze  the 
kind  of  algebraic  codes  that  are  suitable  for  the  proposed  architecture. 

First,  note  that  the  input  layer  of  units  will  simply  buffer  the  feature  vector  to  bo 
classified;  hence  the  units  here  perform  the  identity  trainsformation.  Secondly,  it  should  ho 
clear  that  the  units  in  tne  hidden  and  output  layers  must  be  polar  (i.e  they  output  1  or  -1: 
actually  we  could  have  made  them  biniiry  but  it  is  easier  to  deal  with  polar  outputs  for 
neural  nets).  We  have  decided  to  make  them  threshold  units,  outputting  +1  if  the  net 
input  is  at  least  the  threshold  B  of  the  unit,  -1  otherwise.  The  value  of  B  will  be  determined 
below.  But  first  we  need  to  make  some  elementary  observations  about  {1,-1}  codes. 

3.  Remarks  on  {1,-1}  codes 

We  remairk  that  if  c  and  c’  axe  two  {1,-1}  vectors  of  the  same  length  n.  then 

c  .  c’  -f  2  distance(c,c’)  =  n 

where  the  dot  is  the  usual  dot  product  and  distance  is  the  Hamming  distance  between  tlie 
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two  vectors.  This  is  so  because  for  {1,-1}  vectors,  c  .  c’  is  the  number  of  coordinates  whore 
the  two  vectors  agree  minus  the  number  of  coordinates  where  they  disagree;  hence  r  .  r'  i- 
n  minus  twice  the  Hamming  distance  of  the  two  vectors. 

A  simple  consequence  of  this  observation  is  that  if  C  is  a  {1,-1}  code  of  length  n,  i.e  a 
collection  of  distinct  {1,-1}  vectors  all  of  length  n,  amd  if  M  is  the  maximum  dot  product 
among  different  codewords  of  C,  then  the  minimum  Hamming  distance  among  different 
vectors  of  C  is 

0.5  *  (  n  -  M  ) 

and  thus  the  code  corrects  any  vector  corrupted  by  less  than 

0.25  *  (  n  -  M  ) 

errors,  using  the  usual  nearest  neighbor  decoding  scheme.  Because  of  this.  w('  u;!!  in 
looking  for  {1,-1}  codes  where  the  length  n  is  much  larger  than  the  maximum  dot  product 

M. 


Next,  we  remark  that  if  c  is  the  closest  codeword  in  the  code  C  to  a  {1,-1}  vector  u,  and  if 
w  is  any  other  vector  in  C,  then  the  following  inequalities  hold 

w  .  u  <  0.5  *(n-l-M)<c.u 

This  can  be  seen  as  follows.  Let  u  €  {1,-1}^  and  let  c  be  the  codeword  in  C  closest  to  u 
(note  that  at  first  sight  there  may  be  several  codewords  in  C  having  minimum  distance 
from  u;  however  if  u  has  not  been  corrupted  by  0.25  *  (  n  -  M  )  or  more  errors,  then  this  is 
impossible!;  we  will  assume  this  from  now  on). 

Now  let  u  =  c  d-  e  ,  where  the  error  vector  e  has  its  q  nonzero  components  equal  to  2  or 
-2.  We  have 


Now 


distance(u,w)  >  distance  (u,c),  for  all  w  >-  C  -{c} 


c  .  u  =  n  -  2  distance(c,u)  =  n  -  2q 
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Since  it  is  assumed  that  less  than  0.25  *  (  n  -  M  )  errors  have  been  made  in  corrupting  c  to 
get  u,  we  have 

q  <  0.25  *  (  n  -  M  ) 


Hence 

c.u  =  n-2q>n  -  0.5  *(n-M)=0.5*(n  +  M) 

Also, 

w.u  =  w.(c-)-e) 

=  w  .  c  +  w  .  e 

<  M  +  2q  ,  because  the  q  nonzero  components  of  e  are  2  and  -2 

<  M  -I-  0.5  *  (  n  -  M  )  =  0.5  *  (  n  +  M  ) 

Hence  the  claim  is  proved. 

Note  that  the  last  observation  will  be  used  to  decide  the  threshold  for  the  units  in  tlu 
output  layer,  and  the  interconnection  weights  between  the  hidden  and  output  layers. 


4.  Weight  matrix  from  hidden  to  output  layers 

Choose  a  suitable  code  C  (suitable  in  the  sense  of  having  the  length  n  much  larger  than 
the  maximum  dot  product  M)  and  select  m  codewords 

‘^l’^2’  •••  ’  ‘^m 

from  the  C.  Writing  the  components  of  Cj  as 

Cjj,  Cjj,  ...  ,  c.j^ 

use  the  following  matrix  as  the  weight  matrix  from  the  hidden  to  output  layers 


<^21  C22 

•  •  •  Ci„' 

•  •  •  C2/7 

Cl' 

C2 

^m/i 

J 
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In  other  words,  the  weights  from  the  hidden  layer  to  the  output  layer  are  all  1  or  -1,  and 
the  weights  from  the  n  hidden  units  to  the  jth  output  unit  are  the  components  of  itu 

codeword  Cj  (see  Figure  1). 

Now  if  the  output  of  the  hidden  layer  is  the  vector  u,  then  the  net  input  to  the  jth  output 
unit  is 


c.  .  u 
J 

Hence  if  the  threshold  of  each  unit  in  the  output  layer  is  chosen  to  be 

0.5  *  (  n  4-  M  ) 

then  by  the  observation  in  section  3.  only  the  unit  corresponding  to  the  codeword  closest 
to  u  will  output  1,  the  rest  -1. 


5.  Weight  matrix  from  input  to  hidden  layers 

The  threshold  in  the  units  of  the  hidden  layer  are  chosen  to  be  0.  Now  let  be  the 

weight  matrix  between  the  input  and  the  hidden  layers.  Let  be  the  matrix  whose 

rows  are  the  class  labels  in  the  classification  problem  at  hand. 


\ _ 

''12 

■  ■  ■  ''U 

1 _ 

''21 

'”22 

■  ■  ■  ’’U 

•  »  • 

r2 

'm2 

•  •  •  • 

"  *  ''m* 

rtn 

■ 

Several  schemes  can  be  imagined  for  computing  the  matrix  from  the  matrices  U 


and  C 


mxn' 


8 


The  first  scheme  is  a  simple  correlation,  defined  by 

W,  =  R  ,  ^  C 
kxn  mxk  rnxn 

T 

where  denotes  the  trajispose  of  the  matrix  other  words,  the  weight  frotn 

the  ith  input  unit  to  the  jth  hidden  unit  is  the  sum  of  the  correlations  between  the  ith 


component  of  a  class  label  and  the  jth  component  of  the  corresponding  codeword  (see  also 
Figure  1). 


m 


l=l 


The  idea  here  is  a  natural  one.  In  order  to  achieve  continuity,  we  seek  to  strengthen  the 
connection  between  a  class  label  and  its  corresponding  codeword. 

Another  scheme  ’simulates’  a  linear  transformation  from  feature  space  to  code  space.  If  tlu> 

hypercube  {1,-1}"  were  a  vector  space,  which  it  is  not,  amd  if  the  class  labels,  i.e  the  rows 

of  the  matrix  were  linearly  independent,  then  there  would  be  a  unique  linear 

transformation  mapping  class  label  r*  onto  codeword  Cj,  for  all  i  with  1  <  i  <  m.  In  other 

words,  there  would  be  a  unique  kxn  matrix  W  with 

R  .  .  VV  =  C 
mxk  mxn 

If  R  were  squau-e,  it  would  be  nonsingular  and  we  could  write 

W  =  (R  .  ^  .  C 

'  mxk'  mxn 

which  would  give  the  interconnection  weights  we  are  looking  for  (recall  a  linear 
trainsfonnation  between  finite  vector  spaces  is  always  continuous). 

Of  course,  the  above  assumptions  do  not  hold  in  most  cases.  We  may  though  use  some 
kind  of  approximation  of  the  inverse  for  R,  called  the  pseudoinverse  of  R  and  denc^ed  h’ ' 
(we  have  dropped  the  index  mxk  for  simplicity). 


9 


T 

If  the  matrix  R  is  of  maximum  rank,  but  not  squcire,  then  R  .  R  is  nonsingular  and  the 
pseudoinverse  R  is  defined  by 

R  =  .  (  R  .  r"^  )  ■  ^ 

T 

If  R  .  R  is  not  invertible,  then  R  is  given  by  a  limit 

R  =  lim  (  R*^  .  (  R  .  R*^  +  I )  ■  M  f  ^  0 

T 

where  I  denotes  the  identity  matrix  of  the  size  of  R  .  R  and  e  is  a  positive  real  number. 

It  cam  be  shown  that  the  pseudoinverse  R  always  exists.  Of  course,  in  practice,  we  only 
compute  am  approximation  of  the  limit  most  of  the  time,  since  the  exact  form  of  R  ma\ 
be  difficult  to  obtain.  For  a  nonsingular  matrix,  the  ordinary  inverse  and  the 
pseudoinverse  are  of  course  identical. 

In  any  case,  we  can  take  as  weight  matrix  from  the  input  layer  to  the  hiddt-n  layer  tin 
product 


=  R*.  C 


mxn 


6.  Hadamard  coding 

It  is  believed  that  the  error  rate  of  the  proposed  classifier  depends  heavily  on  the  number 
of  errors  that  the  algebraic  code  wired  into  the  net  can  correct,  which,  as  we  liave  s('er'. ,  s 
a  linetir  function  of  (  n  -  M  ),  where  n  is  the  length  of  the  code  and  M  is  the  maximum  doi 
product.  We  then  need  to  investigate  {1,-1}  codes  for  which  n  is  significantly  larger  than 

M. 


The  class  of  Hadamard  codes  seems  to  provide  a  reasonable  starting  point.  Recall  that 
Hadamard  matrices  constitute  a  handy  tool  in  many  disciplines,  including  signal  processing 
(Hadamard  matrices  are  the  kernel  of  the  so-called  Hadamard- Walsh  transforml  and 
design  of  experiments  (Hadamard  matrices  give  rise  to  good  symriu'trii  coi!ilii:i,i:in 
designs).  Even  though  Hadamard  matrices  exist  only  for  certain  orders,  we  believa'  ttiai 
they  axe  still  quite  useful  for  the  classifier  architecture  proposed. 
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A  Hadamard  matrix  of  order  n  is  a  square  matrix  H  of  order  n,  whose  entries  are  1  or  -1. 
and  such  that 

H  .  =  n  I 

where  I  is  the  identity  matrix  of  order  n. 

The  definition  re«illy  says  that  the  dot  product  of  any  two  distinct  rows  of  H  is  0,  hence  if 
we  taJee  as  codewords  all  the  rows  of  H,  then  M  =  0,  i.e  the  code  can  tolerate  any  number 
of  errors  less  than  0.25n.  Of  course  the  threshold  of  ezich  output  unit  of  the  classifier  must 
be  set  to  0.5n. 

Examples  of  low-order  Hadamard  matrices  are  shown  in  Figure  2,  where  -1  is  simpl\ 
written 


n  =  1:  Hi  =  f  1 

n  =  4;  H4  =  [l  1  1  1 
1  -  1  - 

11 - 

1 - 1 


0  =  2:  H2 

n  =  8:  Hg 


■11111 
1-1-1 
1  1  -  -  1 

1  -  -  1  1 

1111- 

1  -  1  -  - 

11 - 

1  -  -  1  - 


1  1  V 

“  1  -! 
1  -  - 

-  - 1 

1-1' 

I 

-11^ 


Figure  2 

It  can  easily  be  shown  that  a  Hadamard  matrix  of  order  n  can  exist  onlv  if  n  -  1.  "J  or 
0  mod  4.  It  has  not  yet  been  proved  that  a  Hadamard  matrix  of  order  n  exists  when.  \  .  r  ; 
is  a  mutiple  of  4,  even  though  this  is  widely  believed. 
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For  our  purposes  it  is  enough  to  note  that  Hadamcird  matrices  do  exist  for  orders  that  are 


a  power  of  2.  This  is  seen  through  the  fact  that  if  is  a  Hadarnard  matrix  of  order  ii. 
then  is  a  Hadarnard  matrix  of  order  2n>  where 

_  \B„  ■ 


7.  Simplex  coding 

k  k 

Recall  the  construction  of  the  finite  field  GF(p  )  with  p  elements  and  prime  subfield 
GF(p)  =  2p,  where  p  is  a  prime  number  (p  =  2  is  the  most  important  case  for  our 
classifier).  An  irreducible  polynomial  h{x)  with  coefficients  in  GF(p)  and  of  degree  k  is 
chosen.  The  elements  of  GF(p  )  are  taken  to  be  all  polynomials  with  coefficient-s  in  Cil'  ip 
and  of  degree  less  than  k.  Addition  and  multiplication  are  ordinary  addition  and 

multiplication  of  polynomials  but  performed  modulo  h(x).  It  can  easily  be  proved  that 

k  k 

indeed  GF(p  )  satisfies  all  axioms  of  a  field  and  that  it  contains  p  elements. 

k  k 

It  is  well-known  that  the  multiplicative  group  (GF(p  )  -  {0},  *)  of  GF(p  )  is  cyclic  of 
k 

order  p  -  1,  If  x,  as  element  of  the  field,  is  a  generator  of  this  cyclic  group,  then  !i(x)  i.- 
called  a  primitive  polynomial  over  GF(p).  Note  that  databases  of  primitive  polynomials 
over  GF(p)  are  available  for  many  values  of  p. 


Let 

h(x)  =  +  ...  +  h^  -t-  hg 

be  a  primitive  polynomial  of  degree  k  over  GF(2).  So  hj  =  0  or  1  for  all  i  and  the 
irreducibility  of  h(x)  implies  that  h^  =  1. 

We  define  the  pseudo-noise  sequence  generated  by  h(x)  and  given  bits  a  ,,  <i  . .i, 

W  1  f\  ”  I 

the  binary  sequence 

V  ^k-r  ^k’  ®k-i-r  • 
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where,  for  1  >  k, 


^1  -  +  ^1-k  +  l^l  +  \k+2^2  ^1^'k-l 

(addition  and  multiplication  axe  over  GF(2),  i.e  modulo  2). 

The  following  properties  are  well-known  (4|: 


(i)  Let  C  =  "■  ^1+n  j  ^  subsequence  of  length  n  of  a  pseudo-noise 

sequence  (aj)  generated  by  h(x),  where 

n  =  2*^  -  1 

Then  for  any  cyclic  shift  C’  of  C,  C  +  C’  is  ainother  cyclic  shift  of  C. 


k-1  k-1 

(ii)  If  C  and  n  are  as  in  (i)  then  C  has  2  ones  and  2  -  1  zeros. 


Now  from  a  psudo-noise  sequence  generated  by  h(x),  a  binary  code  ot  lengtli  n  =  2  -  1. 

denoted  C”,  is  constructed  by  taking  as  codewords  the  vector 


^0’  ^1’  ^^-2 
k 

and  all  of  its  2  -  2  cyclic  shifts. 

k 

The  code  C"  together  with  the  all-zero  vector  of  length  2  -  1  is  also  known  in  tin 
literature  as  the  cyclic  simplex  code  of  length  2-1  and  dimension  k.  By  properly  (i) 


above,  the  simplex  code  is  a  vector  subspace  of  dimension  k  of  GF(2)' 


/here  n  =  2 


From  C”  a  {1,-1}  code  C  is  constructed,  through  the  adfine  mapping  a  -+  1  -  2a  (i.e  replace 
0  by  1  and  1  by  -1). 

For  u,  w  €  C,  we  can  compute  u.w  as  follows. 

1, 

If  u  =  w  then  clearly  u.w  =  n  =  2  -  1. 

If  u  ^  w  then  u.w  -  n  -  2  distance(u,w)  as  explained  in  section  3.  But  dist.uK  iM  u.w  '  i-  '  ■  • 
distance  between  the  corresponding  vectors  in  C~,  which  by  linearity  of  the  situpiex  code 
C’  U  {0},  equals 
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distance  (  0»  C2  -  ) 

=  distance  (  0,  c„  +  c,  ) 

iL  X 

=  weight  (  C2  +  ) 

k  -  1 

=  2  by  property  (ii)  above. 

So, 

u.w  =  n  -  2  distance(u,w) 

Ic  k  -  1 

=  2^  -  1  -  2  .  2  =  -1 

Hence  for  the  code  C,  the  maximum  dot  product  is  M  =  -1.  C  czm  tolerate  any  number 

k  -  2 

of  errors  less  than  0.25  *  (n+1)  =  2  .  The  threshold  of  each  output  unit  of  the  classifier 

must  be  set  to  0.5  *  (n+1)  —  2^  ~ 

We  give  a  couple  of  examples  to  illustrate  the  construction. 

3 

For  h(x)  =  X  +  X  +  1,  and  aQ  =  0,  a^  =  0  and  a2  =  1,  we  get 


^3 

=  ag  +  aj 

.h, 

.  h2  —  0 

^4 

II 

►-a 

+ 

•b 

.  h2  =  1 

=  a„  +  a^ 

■  h, 

“1“  3L. 

.  h„  =  1 

5 

2  3 

1 

4 

2 

^6 

=  aj  +  a^ 

,hi 

^  ^5 

.  h2  =  1 

The  codes  C  and  C  are 


0  0  10  111 
10  0  10  11 
110  0  10  1 
1110  0  10 
0  1110  0  1 
10  1110  0 
0  1  0  1  1  1  ol 

J 


'11-1 - ■ 

-11-1-- 
-  -  1  1  -  1  - 

- 11-1 

1 - 11- 

-  1  -  -  -  1  1 

1-1 - 1 : 


for  which  we  can  see  that  the  minimum  distance  is  indeed  4. 
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Another  example  is  provided  by  h(x)  =  x"^  +  x  +  1,  =  a,,  =  0  and  a^  =  1  \Vc 

compute 

^4^5^6^7^8^9^10^1 1^12^13^14  “  OOHOIOHH 
The  code  C  is  shown  below,  for  which  the  {1,-1}  code  C  is  easily  obtained. 


000100110101111 

100010011010111 

110001001101011 

111000100110101 

111100010011010 

011110001001101 

101111000100110 

010111100010011 

101011110001001 

110101111000100 

011010111100010 

001101011110001 

100110101111000 

010011010111100 

001001101011110 


8.  Future  work 

We  need  to  analyze  more  classes  of  algebraic  codes  suitable  for  the  proposed  architecture. 
One  of  the  promising  directions  is  the  area  of  codes  from  block  designs  (as  mentioned 
earlier,  the  Hadamard  codes  actually  fall  under  thb  area).  Properties  of  these  codes  are 
often  distilled  from  the  geometric  properties  of  the  designs.  Work  in  thi^^  ar-  .i  : 
progress. 

Next  we  should  actually  attempt  an  analytical  study  of  the  error  rate  of  the  proposed 
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classifier.  Note  that  the  mapping  from  feature  space  to  code  space  may  actually  inlrodiicr 
more  errors.  Even  if  this  mapping  does  not  introduce  any  error,  it,  may  still  be  (iiffiruh  ti 
derive  a  closed-form  of  the  error  rate,  so  bounds  may  be  the  best  we  can  do. 

Finally  we  need  to  design  an  adaptive  version  of  the  proposed  architecture,  where  the  code 
is  somehow  wired  in  the  net  but  a  bettering  of  the  estimates  of  the  parameters  of  the 
classifier  is  abo  accomplished  through  training  data. 
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An  Interaction  Model  of 
Neural-Net  based  Associated  Memories 


1.  Introduction 

One  of  the  most  focused  topic  in  the  entire  neural  network  paradigm  that  still  contin¬ 
ues  engagement  of  major  research  effort  is  evidently  the  topic  of  Associative  or  Content- 
Addressable  Memory  (CAM)  in  the  context  of  a  Connectionist  framework  (9).  An  Associ¬ 
ative  Memory  (AM)  or  a  Content-Addressable  Memory  is  a  memory  unit  that  maps  an 
input  pattern  or  a  feature  vector  €  Rn  to  an  associated  output  vector  y^  6  Ria  in  0(1) 
time.  The  input  pattern  x^  may  be  a  substring  of  the  desired  output  pattern  y.-,  or  may  be 
totally  distinct;  however,  in  both  cases,  we  identify  an  association  of  the  form  (xj^,  y^)  for 
each  pattern  index  k.  An  AM  or  CAM  is  a  storage  of  all  such  association  pairs  over  a 
given  pattern  domain.  The  input  string  is  also  known  as  the  cue-vector,  or  a  stimulus,  the 
output  string,  the  associated  response  pattern. 

Viewing  it  from  a  neural  network  perspective,  the  ensemble,  even  if  conceptualized  as 
a  pattern  classifier/recognizer  in  a  strict  functional  sense,  is  essentially  an  .\M /C.A.M  enti:;. 
if  it  works  correctly.  Because  ultimately,  a  neural-net  based  classifier 'recognizer  logica'.iy 
maps  an  external  object  via  an  appropriate  feature  space  onto  distinct  cla^-':s.  we  cw:.-: 
say  that  in  a  functional  context  (object  — »  input  pattern  (stimulus),  pattern-class, 
comprise  associative  pairs.  In  this  section  we  present  a  new  model  of  Associated  .Memories 
or  Content-addressable  Memories  realizable,  say,  via  an  appropriate  optical  implementa¬ 
tion  of  an  artificial  neural  net  type  system  using  lens,  masks,  gratings,  etc  (4,6).  The  issue 
addressed  here  is  not  this  realization  task  per  se,  but  the  more  important  issue  of  models 
basically  congruent  to  neural  net  type  systems  from  the  point  of  view  of  functionality  and 
performance,  the  models  which  reflect  distributivity.  robustness,  fault-tolerance,  etc.  T;;.' 
implementation  issue  has  to  be  addressed  eventually  —  one  could  aim  for  a  neural  net 
system  in  an  electronic  (as  opposed  to  optical)  environment  but  that  as  a  topic  must  w,-' 
till  one  clears  out  the  model  related  issues,  the  overall  architectural  framework  of  achieving 
a  certain  set  of  objectives  at  the  topmost  level  of  specification. 

In  this  paper,  a  matrix  model  of  Associative  memory  is  proposed.  The  strength  of  the 
model  lies  in  the  way  it  is  different  from  the  traditional  AM  models  articulated  thus  far 
(see  Hopfield  (7,11),  Kohonen  (8,9)  etc.  for  Associative  Memory  models).  This  model 
becomes  significant  at  high  storage  density  of  patterns  when  the  logically  neighboring  pat¬ 
terns  do  tend  to  perturb  each  other  considerably.  A  memory  system  even  locally  stable  in 
the  sense  of  Lyapunov  at  or  around  specific  logical  sites  may  tend  to  display  erratic ity 
upon  receiving  cinother  pattern  for  storage  if  the  pattern  storage  density  increases  (1.2.31, 
In  such  cases,  instead  of  trying  to  keep  patterns  away  from  interacting  with  each  other, 
one  coiiid  deliberately  allow  them  to  interact  and  then  consider  all  such  roller' ive  ir.’er.: 
tions  in  the  overall  dynamics  of  the  system.  To  date,  no  such  modi'!  h;is  beep,  prooiw.’';. 

Our  model  is  different  from  the  proposed  classical  memory  models  in  one  sign;: a',;:.: 
sense.  It  allows  interactions  among  patterns  to  perturb  the  associated  feature  space.  .\s  we 
pack  dynamically  more  and  more  patterns  into  the  system,  more  skewed  becomes  ti-.e 
'neighborhood'  of  a  pattern.  It  is  analogous  to  the  situation  where  an  electric  field  around 
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a  charged  element  is  peturbed  by  bringing  in  more  charge  particles  in  the  vicinity.  It  is 
this  recognition  in  our  model  that  provides  a  changed  perspective. 

We  observe  at  the  outset  the  two  salient  features  associated  with  neural  net  type  pro¬ 
cessors  or  systems  using  such  processors.  First,  a  neural  network  model  is  ultimately  a 
content-addressable  memory  (CAM)  or  even  an  an  associative  memory  (AM)  depending 
upon  how  it  b  implemented.  The  central  theme  in  such  modeb  b  storing  of  information  of 
the  relevant  patterns  dbtributively  in  a  holbtic  fashion  over  the  synaptic  weights  between 
pairs  of  neuron-like  processor  elements  allowing  both  storage  and  retrieval  tasks  more  or 
less  in  a  fail-soft  manner.  Even  though  the  time-constant  of  a  biological  neuron’s  switching 
time  b  of  the  order  of  millbeconds,  storage  and  recall  of  complex  patterns  are  usually  car¬ 
ried  out  speedily  in  a  parallel  dbtributed  processing  mode.  It  is  ultimately  this  notion  of 
dbtributed  memory  in  contrast  to  local  memory  of  conventional  computers  that  stands  out 
in  so  far  as  neural  net  type  systems  are  concerned. 

A  neural  network  system  could  also  be  viewed  at  the  same  time  as  a  pattern 
classifier/recognizer.  Problems  like  Radar  target  classification,  object  identification  from 
received  sonar  signab  are  where  neural  net  type  systems  are  most  adept.  In  particular,  if  a 
recognition  of  a  partially  viewed  object  or  a  partially  described  object  has  to  be  made,  if 
the  data  is  further  contaminated  with  white  noise  then  not  only  a  neural-based  classifier 
appears  to  be  more  promising  for  real-time  classification  from  the  point  of  view  of  perfijr- 
mance  but  a  CAM  or  an  AM  oriented  system  might  just  be  ideal.  It  depends,  of  course,  or. 
how  we  ultimately  structure  the  problem  domain,  what  type  of  cues  axe  allowed  for 
appropriate  responses,  etc.  but  it  b  eminently  possible  to  lean  on  a  specific  CAM/.A.M 
oriented  neural-net  paradigm  to  do  just  that.  In  thb  section,  a  generalized  outer  product 
model  b  proposed. 

The  associated-memory  elements  ought  to  yield  a  performance  close  to  0(1)  search 
time.  However,  once  pattern-pattern  interactions  are  allowed  within  the  model  one  cannot 
guarantee  0(1)  performance.  In  our  model,  it  is  feasible  that  on  some  cue  patterns  t.he 
problem  b  so  straight  forward  that  one  need  not  consider  pattern-pattern  interaction  at 
all.  In  such  cases,  it  behaves  as  classical  AM  element  with  0(1)  search-time. 

Consider  an  orthonormal  basis  {<^k}  spanning  an  L-dimensional  Euclidean  vector 
space  in  which  associated  pairs  of  patterns  (cue,  response)  in  the  form  (x^,  y)j)  are  stored 
in  correlation-matrix  format  of  association.  The  whole  idea  b  to  buffer  the  stored  patterns 
from  the  noisy  cue  vecors  via  thb  basb  such  that  one  could  see  the  logical  association 
between  x  and  y  via  a  third  entity,  our  basb.  Note  that  a  cue  vector  x^  in  this  basis  is 
expressed  as 

|xic  >  =  2  did  l<^i> 

i 

with  its  transpose  as 


<Xk  I  =  E  “ki  < 

i 
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such  that  the  dot  product  between  two  such  vectors  ]  X](>  and  [  X|  >  is 


<Xkh>  =  E  «kiaij<5ij 
ij 

=  E  *^10  =  aki 

i 


The  Euclidean  norm  of  a  vector  is  then 

<Xk|xk>  =  E:  {^*ki)^  =  akk 
i 

Note  that  the  cue-vectors  themselves,  in  this  model,  need  not  constitute  an  orthogonal  vec¬ 
tor  space.  Even  though  this  is  a  desirable  assumption  to  consider,  in  reality,  it  may  not  be 
a  good  assumption  to  depend  on.  Secondly,  the  vector  space  {(^}  need  not  be  complete,  a.T 
appropriate  subsapce  would  do  as  indicated  by  Oja  and  Kohonen  (13).  They  build  the  pat¬ 
tern  space  over  some  p-dimensional  subspace  with  p  <  L  and  deal  with  a  pattern  by  hav¬ 
ing  its  projection  on  this  subspace  and  e.xtrapolating  its  residua!  as  an  orthogonal  projec¬ 
tion  on  this  space.  VVe  could  also  extend  our  model  in  this  way. 

We  could  interpret  the  coefficients  Qki  associated  with  a  cue-feature  as  follows.  The 
normalized  coefficient  oja/E  could  be  regarded  in  this  model  as  the  probability  ampli- 

j 

tude  that  the  vector  |  Xk>  has  a  nonzero  projection  on  the  elemental  basis  vector  1  d,  >.  The 
square  of  such  terms  are  the  associated  probabilities;  if  we  want  to  picture  it  a  cue  vector 
in  this  representation  would  be  a  sequence  of  spikes  on  the  ci>-domain  whose  height  at  any 
specific  d>-point  is  the  probability  amplitude  that  the  vector  there  has  a  nonzero  projection. 
Note  that  the  sum  of  the  square  of  all  these  projections  must  necessarily  equal  to  1.0  for  a 
nonzero  cue-vector. 

Given  (cue,response)  pairs  we  form  matrix  associative  memory  operators  as  follows: 

M,  =  E  l0k><Xkl 

k 

Mr  =  E  IXkX^kl 

k 

The  operator  M,  on  |  Xk  >  cue  is 

^i>  =  E  l^k><Xk^i> 
k 

>  an  -f  E  ^ki  l0k 

k«l 
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Such  that  its  projection  on  ainy  arbitrary  ( <^q>  is 
<<^’q  I  Ms  1  Xi  >  =  ail  ^ql  +  a^il 


We  call  this  the  figure  of  merit  of  a  cue  Xi  on  a  basis  vector  fm(xi  | 

Obviously,  the  memory  operator  has  the  largest  projection  when  q=l,  i.e. 

index  {max  {  fm(xi  |  <^q)  }  =  1 

q 

Note  that  if  max  {  fm{xi|(^s),  s  =/=  q  }  c:;  an,  then  the  cue  xi  may  not  be  discernible 

3 

from  another  cue  associated  with  which  is  an  altogether  different  response.  We  denote  the 
associated  domain  of  certainty  by  a  resolution  measure  res{y^),  where 

res(y3)  =  res(y3 1  <^3) 

=  1  -  “  {max  [fm  (xiji^s)] 

an  s^iq 

In  the  simplest  case,  one  assumes  res(.)  on  associated  pattern  vectors  to  be  very  large.  Or. 
equivalently,  one  assumes  that  the  storage  of  multiple  (cue,response)  pairs  is  facilitated  by 
the  requirement  that  the  Hamming  distance  between  any  two  pair  of  such  cues  in  the 
library  is  relatively  large.  In  such  cases,  after  receiving  the  index  value  1  associated  with 
the  given  cue  |  Xi>,  one  continues  by  operating  M,  on  ]  (;6i>.  Then 

=  2  |yk><0kl0i>  =Iyi> 

k 

indicating  successful  retrieval  of  the  associated  pattern  lyi>. 

Note  that,  in  general,  a  received  stimulus  may  not  appear  in  the  form  as  one  desires. 
It  could  be  a  noise  contaminated  incomplete  cue-function  of  the  form 

1  x>  =  7I  xi>  -hi  S>,  where  <xt  |  6>  =0 

and,  0  <  -7  <  1.  Then  a  memory  operation  on  x  is 

M,  I  x>  =  7  M3 1  X|  >  -h  M,  1 5> 

=  7aiil0i>+7E  aid  |<?hc>  +  E  I^J^kX^k  l<5> 

k«l  k 

Now,  given  <5  =  v  |  Oi  >  we  obtain 
i«l 

<Xk  15>  =  E  ^i«ki 

ix*! 
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1^6  1^^  —  ^  (^mm  ^ml) 

ix'l 


The  matrix  elements  show  deterioration  of  performance.  Since,  the  cue-vectors  lu  cd  not.  in 
general,  form  an  orthonormal  basis  the  first  term  on  the  right  hand  side  of  the  matrix  ele¬ 
ment  would,  in  general,  contain  the  usual  cross-talk  component,  but  now,  in  addition  to 
this,  the  distortion  6,  even  though  orthogonal  to  the  most  likely  cue-vector  xi,  would  intro¬ 
duce  additional  degradation. 

The  question  of  optimization  of  the  memory  space  is  now  the  issue.  Given  the  possi¬ 
bility  of  receiving  distorted  signals  at  input  ports  whence  one  obtains  the  cue-vector  in 
question,  one  may  approach  the  design  problem  in  a  number  of  ways. 

One  simple  approach,  particularly  when  storage  density  is  sparse,  is  to  consider  the 
possibility  that  the  received  cue  x  is  close  to  one  of  the  cues  considered  acceptable.  Under 
this  assumption,  we  form  both  the  on  and  the  off-diagonal  terms  <(;6ql  Mj  x>  and  note  the 
index  q  for  which  the  matrix  element  is  largest.  We  then  logically  associate  x  with  y,,  on 
the  asstimption  that  yq  is  what  is  stored  logically  at  the  basis  d>q.  The  task  of  retrieving 
the  associated  pattern  given  a  cue  is  then  a  straightforward  problem  as  long  as  a  relatively 
sparsely  dense  set  of  (cue,response)  pairs  is  associatively  stored.  The  problem  occurs  when 
cues  are  close  to  each  other  or  when  one  finds  that  a  single  level  discriminant  may  nor  suf¬ 
fice.  If  an  unknown  cue  x  is  not  stored  precisely  in  the  format  it  has  been  received,  if  it  is 
equally  close  to  Xq  as  it  is  to  with  q  =9^  r,  we  must  have  further  information  to  break  the 
impasse. 

One  simple  strategy  is  to  extend  the  proposed  abstract  AM  model  as  follows.  In  this 
case,  we  do  not  store  just  single  instance  (cue,  response)  items  but,  instead,  a  class  of 
items.  In  other  words,  we  prepare  a  class-associative  storage  (  or  a  class-content  address¬ 
able  storage)  in  the  form  below 

~  {(^Iktyk))  (^2kiyk)'  •••>  (^jk>yk)i  •••  } 


Me,  =  E  l^kXXikl,  and 
ik 

Mcr  =  E  lyk><0kl 

k 

Though  the  essential  structure  of  the  tuple  (cuei,  cue2,  ...,  response)  is  simple  enough  it 
could  be  exploited  in  a  ntimber  of  ways  allowing  us  to  extend  the  scope  of  simple 
CAM/AM  based  storage/retrieval  task  either  through  a  spatial  or  a  temporal  extension,  or 
a  combination  of  both. 

In  the  sequel,  we  conjecture  some  simple  schemes. 

A.  Spatial  Extension. 

1.  Concurrently,  obtain  input  signals  {xj,  1  <  i  <  n}.  These  are  n  different 
instances  of  same  measure  or  different  measures  of  same  feature  element.  The 
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underlying  hypothesis  is  that  all  these  Xj  measures  point  to  essentially  the  same 
response  vector  y. 


2.  Obtain,  in  parallel,  (x-,,  yj. 

3.  At  next  cycle,  at  the  next  higher  level,  obtain  ^(yi)  for  every  occurrence  of  y,. 
r]{.)  yields  the  number  of  occurrence  of  y-,. 

4.  Obtain  y*  =  max  {»7(yi)?(xi)  }  where 

yi 

f(xi)  =  max  fm(xi|  4>q) 

^<1 

5.  Output  (y*  |xi,  X2,  ...  pcn) 

In  this  case  a  concurrent  evaluation  is  proposed  after  the  input  stage.  At  the  next  layer,  a 
decision  logic  gate  at  the  output  port  yields  the  optimum  desired  response. 


Obviously,  this  simple  design  is  robust,  distributed  and  modular.  It  also  provides  mul¬ 
tiple  modular  redundancy.  In  the  next  proposed  extension,  we  introduce  a  model  which  is 
essentially  extensive  in  the  temporal  domain. 

B.  Temporal  Extension 

1.  i  1 

2.  .4t  cycle  i,  obtain  (xj,  y;) 

3.  While  res(yi)  <  6  do 

3.1  i  ^  i^l 

3.2  obtain  at  the  ith  cycle 

((Xi,yi)  |res(yi_i)  <  6,  res(yi_2)  <  S,  ...  ,  res(yi)  <  d  ) 
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3.3  compute  res(yi) 


4.  Output  (y-,*  I  (xi_i,  yi_i),  ...  ,(xi,  y,)) 


Here  the  crucial  element  is  the  retrieval  process  specified  in  3.2.  At  the  ith  cycle,  the 
scheduled  retrieval  process  is  in  WAIT  state  until  it  is  fired  by  the  appropriate  event  at 
the  (i-l)th  cycle,  namely  the  arrival  of  the  condition  res(yi_i)  <  6.  The  two  schemes  are 
outlined  below. 


Next  level  of  extension  is  conjectured  on  a  different  basis.  In  this  case,  we  form 
memories  based  on  pair-wise  product  form  of  original  basis  vectors  as  shown 

below.  We  call  this  a  two-point  memory  system,  a  special  case  of  the  more  general  mul¬ 
tipoint  memory  system  which  we  propose  shortly. 

Mp  =  a^i  0^1  -  Py[  Vik 
kl 


w  ere. 


V'ki  =l0k(l)  0i(2)><Xk(l)  xi(2)| 

=h(l)  <^(2)><Xk(l)  xi(2)| 

In  this  scheme,  one  argues  that 

•  Logically,  the  cue-vector  j  X)^  >  is  associated  with  the  basis  vector  b,  '• ,  and  >; 
with  bi  with  a  joint  probability  avi . 

•  However,  closer  the  cue  vectors  |x(c>  and  |xi  >  aire  more  is  the  er.'or  one  :s  hken. 
to  make  in  this  association.  In  view  of  probable  misclassification.  the  weight  or. 
the  initial  proposition  must  therefore  be  reduced.  This  we  do  by  considering  the 
counter  proposal  that  one  could  have  {xj^  >  associated  with  bi  >  and  bi  >  with  bk  • 
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respectively,  with  a  probability  (5y\,  instead.  Higher  is  our  conviction  in  this 
regard  (higher  bki  values)  lower  becomes  the  strength  of  our  original  hypothesis. 

In  this  model,  the  matrix  elements  assume  the  forms 

<^«(l)*.(2)|M5|xp(l)x,(2)>  —  Q^inn  ^Tup  ^mq 

Given  an  arbitrary  cue  vector  x,  let  us  assume  that  res  (y  |  x)  <  (5,  and  that  within  a 
threshold  limit  the  cue-vector  x  is  similar  to  other  cue-vectors  whose  logical  associations 
zu-e,  however,  different.  That  is,  we  suppose  an  equivalence  class  associated  with  x 

C  =  {  (xi,yi  lx«xi),  (X2,y2  Ixs^xj),  ...  ,  (x„,y„  1  x^jx^)  } 

Given  this,  we  would  like  to  know  how  best  one  infers  the  corresponding  associated  vector 
y  with  X.  Let  us  denote  the  matrix  element 

<<^m(l)  <^n(2)  IMJ  |x(l)  Xq(2)>  =  (r;i,n,  ,q) 

with  X  C:  I  Xq  >  and  Xq  is  logically  a  priori  associated  with  some  y,j  via  the  function  c.  .  Or. 
the  as; ’rtion  that  x  is  associated  to  some  y  .-ir  '.he  function  (p^,  the  matrix  element 
(m,n,*,q)  becomes  a  measure  of  the  strength  of  this  hypothesis.  Therefore,  to  obtain  the 
optimum  association,  we  compute 

max  (m,  n,  *,  q) 
m,n.q 

and  obtain  the  corresponding  (p^,  whence  we  obtain  the  most  logical  association  y  via 

MrUm>  =  y 

One  could  extend  this  notion  of  logical  substitutability  and  reorgcinize  the  association  space 
accordingly. 

This  is  carried  out  next  in  our  three-point  memory  system  model  as  indicated  below. 

=  E  E  P  {  (-I)**  Q'kim  I  A(l)  0i{2)  0m{3)>}  <Xk{l)  x,(2)  x^(3) ! 

klm  p 

where,  P  is  a  permutation  operator  with  a  parity  p. 


Note  that  the  right  hand  side  is  summed  over  all  permutations.  The  matrix  elements  are 
then 
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<(6,(1)  *.(2)  *,(3)  IMJ  |Xk(l)  X|(2)  x„(3)>  =  E  P  {  (-I)"  a,,  a,..  } 

P 


which  then  works  out  to 

(r,  S,  t,  k,  1,  m)  =  0!j5t  ~  ^rt  ~  ^rts 

■  “tar  ^tk^siann  +  “str  ^skanann  +  “tis  ^tkariasm 

where  (r^,t,k,l,in)  is  the  matrix  element  in  question.  Note  that  the  substitutability 
rationale  could  be  advanced  in  the  same  way  as  we  airticulated  it  earlier  in  the  two  point 
memory  system.  If  the  string  is  found  similar  to  X]  and  then  the  strength  of  the 
hypothesis  concerned  with  the  a  priori  association  of  with  ought  to  be  adjusted  in 
view  of  the  livelihood  of  false  memory  assignments  via  (^k,  <^1,  and  In  our  model,  we 
propose  that  with  respect  to  the  original  configuration  <;6k(l)  4>i{2)  every  other 

assignment  of  the  form  0k,  (1)  0k,  (2)  0k,  (3),  where  kj,  ko,  k3  is  some  permutation  of  the 
indices  k,l,m  is  either  an  inhibitory  memory  association  (reduces  the  strength  of  our  initial 
hypothesis)  or  a  contributory  memory  association  (advances  the  strength  of  our  initial 
hypothesis)  depending  on  the  number  of  transpositions  required  to  transform  {k.l.m}  to 
{ki,  ko,  k3}.  That  is,  if  there  are  q  such  transpositions  required  to  bring  the  string  {k.l.m} 
into  (ki,  ko,  k3}  then  the  parity  of  the  new  memory  assignment  is  (  —  1)  ’  as  though  eve.'v 
single  transposition  is  tantamount  to  an  adjustment  contrary  to  the  direction  of  the  origi¬ 
nal  assignment. 

We  consider  the  following  diagram  to  illustrate  the  point  that  our  model  suggests. 


Assume  that  on  the  energy  "hill" ,  at  the  consecutive  local  minima,  the  distinc:  p.r- 
terns  are  stored  as  shown  by  the  points  P,  Q,  R,  etc.  Suppose,  also,  that  these  patterns 
are  so  close  to  each  other  that  sometimes  instead  of  retrieving  the  pattern  Q  from  the 
storage  we  recall  some  pattern  R.  It  is  pointless  to  suggest  that  we  sharpen  our  pattern 
specification  more  tightly  with  a  view  to  minimizing  pattern  misclassification  to  some 
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tolerable  level.  Accordingly,  we  approach  the  problem  in  the  following  way.  Indeed,  on 
some  cue,  let  us  assume,  it  appears  that  the  most  probable  recall  is,  say,  pattern  Q.  How¬ 
ever,  what  if  we  had  the  patterns  stored  in  the  order  say  R  Q  P  S  T  ...  or  P  Q  R  T  S  .. 
instead  of  the  indicated  order  of  storage  on  the  hill  P  Q  R  S  T  ...  Would  that  make  any 
difference  though?  Would  we  still  return  the  pattern  R  when  we  w'ere  supposed  to  return 
Q,  instead?  Surely,  these  alternative  orders  au’e  also  probable  with  some  nonzero  probabili¬ 
ties  simply  because  the  patterns  (R  and  P)  in  the  first  suggested  list,  arid  the  patterns  (T 
and  S)  in  the  second  list  are  conceivably  interehangeablt  to  the  extent  these  few  similar 
patterns  are  concerned.  But,  then,  why  restrict  ourselves  to  single-transposition  lists? 
Surely,  ifPQRST...  isa  feasible  list,  then  the  list  obtained  by  double  transpositions 
such  as  (P  R)  (P  T)  {P  Q  R  S  T  ...  }  =  {T  Q  P  S  R  ...  }  is  also  a  probable.  However,  we 
make  a  somewhat  qualified  claim  now.  If  the  storage  order  in  the  first  approximation  on 
the  'hiir  seems  to  be  the  list  P  Q  R  S  T  ...  coming  down  the  hill,  then  a  storage  order 
implying  a  p-transposition  on  the  original  order  ought  to  be  less  probable  than  the  one 
with  a  q-transposition,  if  q  <  p.  In  other  words,  in  our  scheme  of  approach,  the  one  point 
memory  model  is  the  most  optimum  first-order  AM/CAM  model  we  could  think  of.  Its 
improvement  implies  a  model  in  which  interactions  among  patterns  cannot  be  ignored  any 
more.  The  two-points  memory  system,  and  then  the  three-points  memory  systems  are  then 
the  second  and  the  third-order  approximations,  as  indicated,  with  more  and  more  pattern- 
interactions  taken  into  account. 

Therefore,  one  could  suggest  a  plausible  model  in  which  we  carry  out  storage  and 
retrieval  of  patterns  as  follows.  We  assume  that  AM/CAM  memory  need  not  always 
remain  static.  We,  instead,  suggest  that  such  a  memory  should  be  dynamically  restruc¬ 
tured  based  on  how  successful  one  continues  to  be  with  the  process  of  recall.  We  assume 
the  memory  system  to  be  in  state  7,  that  is  its  current  memory  is  so  densely  packed  that 
at  time  t  it  is  a  7-point  memory  at  some  level  p  (p  is  an  integer  less  than  or  equal  to  7)  as 
indicated  below 

=  E  EP  {(-l)^^klm...q  t<^k(l)0l(2)0ni(3)  •  •  •  <^,l(n,,)>} 
klm...q  p 

<Xk(l)xi(2)x,ji(3)  ■  •  •  Xq(n,^)  I 

If  recall  of  similar  patterns,  in  this  memory,  gives  us  a  hit  ratio  less  than  some  th.-es- 
hold  parameter  p-,  then  we  should  restructure  the  memory  as  a  element,  and  con¬ 

tinue  using  it.  This,  probably,  is  what  does  tend  to  happen  in  using  biological  memory. 
Whenever  we  tend  to  face  difficulties  in  recalling  stored  information  (with  a  presumed 
similarity  with  other  information  strings),  perhaps,  even  consciously  at  times,  we  restruc¬ 
ture  our  memory  elements  focusing  on  other  nontrivial  features  of  the  patterns  not  con¬ 
sidered  before. 

Acordingly,  one  could  approach  the  storage/retrieval  policy  as  follows.  .Assume  that 
the  system  is  at  state  7,  i.e.  its  memory  is  at  most  Mi^.  Given  an  incoming  patte.'-n  \.  'i.t 
system  tries  to  return  a  vector  Y  using  the  simplest  memory  M['  in  which  intc.'-artio:.' 
among  the  stored  patterns  are  ignored.  If,  within  some  threshold  parameter  0^.  it  find. 
that  it  could,  instead,  return  ainy  one  of  the  vectors  in  the  set  =  {  z},zi,  ...  .z/j}  then  it 
considers  its  memory  to  be  logically  the  M7  and  attempts  to  return  y*  with  less  uncer¬ 
tainty.  If,  however,  it  finds  that,  within  some  threshold  parameter  SI,  it  could,  instead. 
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return  any  one  of  the  vectors  in  the  set  S]  =  {zf,  z\,  ...  then  it  considers  its  memory 
organized  as  and  continues  with  the  process.  We  assume  that,  while  the  system  is  at 
state  7,  the  memory  could  be  allowed  to  evolve  sequentially  at  most  to  its  highest  level 
M^.  ' 

What  if,  the  system  fails  to  resolve  an  input  cue  vector  even  at  the  memory  level  Mp? 
Then,  we  could  reject  that  pattern  as  irresolvable  and  continue  with  a  fresh  input.  The 
system  could  be  designed  to  migrate  from  the  state  7  to  the  state  7+I  if  the  number  of 
rejects  in  the  irresolvable  class  mounts  up  rapidly.  Otherwise,  we  leave  it  at  the  state  7, 
and  on  each  fresh  cuelT  we  let  the  system  work  through  M7  — ♦  — ♦  •••  — ♦  as 

need  be. 
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Models  of  Adaptive  Neural-net  based 
Pattern  Classifiers/Rocognizers 

A  neural  network  based  system  implemented  as  a  collection  of  elemental  processors 
working  cooperatively  in  a  fimctional  setting  in  specific  problem  domains  like  controlling  an 
unknown  dynamical  system  autonomously  is  ultimately  a  learning  entity  that  evolves 
through  some  form  of  self-organization.  Learning  is  mediated  via  recurrent  sampling  on  an 
open  event  domain  based  on  the  paradigm  of  "learning  by  examples" ,  whereby  the  pair¬ 
wise  connections  among  the  "neurons',  the  synaptic  weights,  asymptotically  tend  to 
approach  a  globally  stable  equilibrium  state  on  the  assumption  that  the  relevant  training 
set  itself  remains  stable  during  the  learning  phase.  In  this  framework,  an  ideal  neural  net¬ 
work  type  system  is  functionally  equivalent  to  an  adaptive  pattern  recognizer  whether  it  is 
designed  to  function  as  a  front-end  image-compressor  system  via  an  appropriate  vector 
quantization  of  images,  or  as  a  system  that  yields  optimum  or  almost  optimum  tours  of 
Traveling  Salesman  problems.  It  is  in  this  regard  that  neural  network  models  have  had 
been  most  extensively  studied  —  as  adaptive  pattern  classifier/recognizers,  as  systems  that 
could  autonomously  learn  and  function  in  an  unknown  problem  environment. 

Accordingly,  it  is  this  specific  area  of  clustering  and  pattern  recognition  in  which 
major  research  thrust  continues  today  (4)  in  one  form  or  another.  The  problem  here  is  pri¬ 
marily  two  fold:  (a)  identification  of  an  appropriate  learning  (or  mapping)  algorithm,  and 
(b),  given  (a),  identification  of  an  implementable  architecture  that  reflects  a  functionally 
parallel  distributed  system  and  support  of  such  a  system  functionally  at  a  high- 
performance  level.  A  neural  network  type  system  emerges  as  a  powerful  high-performance 
machine  because  it  is  inherently  concurrent  at  instruction  and  at  task  levels;  amy 
compromise  with  this  specific  feature  would  surely  lead  to  a  precipitous  drop  in  its  perfor¬ 
mance.  Given  this  imperative  one  could  alternatively  approach  the  problem  from  the  oppo 
site  angle;  Obtain  an  appropriate  learning  algorithm  given  that  the  processing  a.’-chitectur'’ 
must  be  parallel-distributed  with  a  maximum  exploitation  of  system  level  concurrency. 

The  question  of  validity  of  neural  network  models  is  no  longer  the  issue.  The  per¬ 
tinent  issue  now  is  how  to  implement  what  need  be  implemented,  i.e.  the  realization  of  a 
feasible  and  a  promising  learning  algorithm,  and  the  attendant  system  level  architecture.  In 
this  paper,  the  frzimework  is  set  on  the  assumption  that  lezmning  categories  of  patterns  by 
an  unsupervised  collection  of  automata  in  a  distributed  computation  mode  is  where  the 
research  interest  is  focused.  Accordingly,  this  is  the  area  we  concentrate  on  this  paper. 

Patterns  over  Feature  space 

Objects  of  interest,  through  appropriate  feature  extraction  and  encniiip.^,  .ir*' 
veniently  mappefi  onto  patterns  with  binary-vaiued  or  analog  features.  [nveT.  is-.- ,  p.i::.  :: 
over  feature  vectors  could  be  made  associative  with  the  corresponding  ob'erts  wiiic:’.  e o:;- 
stitutes  the  notion  of  equivalence  classes  in  the  following  sense 

•  A  pattern  Xj  =  (xi;,  Xoj,  ... 
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is  in  class  c;^  if  the  corresponding  class-exemplar  x>  is 

near  to  x,  within  a  tolerance  distance  (or  an  uncertainty)  of  <5;^,  i.e. 

Xa}  <  <5a 
where, 

A  =  index  (min  {d(xi,  x^)}) 

A* 

where  d(.,.)  is  some  suitable  distance  measure  on  the  metric  spanned  by  the  pattern  vec¬ 
tors. 

In  an  unsupervised  environment,  the  emerging  class-exemplars  or  the  cluster  proto¬ 
types  in  a  multiclass  domain  could  be  made  to  obey  some  additional  constrains.  Assuming 
that  we  do  not  a  priori  know  their  distributions  except  the  requirement  that  their  mutual 
separation  must  be  at  least  as  large  as  some  threshold  parameter,  one  could  require  that 

d{xA,  x^)  >  Pmin 

The  idea  is  that  at  some  distance  or  lower  a  pair  of  clusters  may  lose  their  individual  dis¬ 
tinctiveness  so  much  so  that  one  could  merge  the  two  to  form  a  single  cluster.  Precisely 
how  ought  to  be  stated  is  a  debatable  issue,  but,  it  could  be  made  related  to  the 
inter-cluster  distance  (ICD)  distribution  which  we  may  conjecture  to  be  a  normal  distribu¬ 
tion  with  a  mean  p  and  a  variance  Accordingly,  we  could  let 


Pmin  =  p  +  p  O’/,, 


where  p  is  some  number  around  1.0,  and 

Op  is  the  variance  of  the  ICD-distribution  N(p,  a'),  assumed  to  be  small.  The  problem  is. 
that  we  usually  do  not  know  this  distribution  a  priori.  For  an  unknown  pattern  domain,  it 
may  not  even  be  possible  to  predict  a  priori  in  how  mainy  distinct  pattern  classes  should 
the  pattern  space  be  partitioned.  All  we  know  is  that  more  separated  the  equivalence 
classes  are  more  accurate  would  one  be  in  ascertaining  that  a  specific  pattern  belongs  to  a 
specific  class. 

Ideally,  clusters  ought  to  be  well  separated  on  the  feature  plane  where  the  patterns 
are  recognized.  This  is  desired  by  a  high  value  of  p.  This  could  be  achieved  by  designing  a 
feattire  space  in  which  small  differences  at  individual  pattern  levels  could  be  captured.  This 
is  a  design  issue,  an  encoding  problem  which  could  be  separately  tackled  to  an  arbitrary 
degree  of  performance.  This  issue  would  be  addressed  in  our  subsequent  papers.  However, 
there  is  another  problem,  which  we  should  look  into.  This  atrises  in  situations  where  the 
order  of  input  matters.  In  a  functional  system,  we  do  want  the  pattern-classes  to  be  tight, 
i.e.  we  want  the  mtra-cluster  distance  measures  to  be  as  small  as  possible  so  that  we  co;;id 
have  crisp  clusters  and  retain  them  as  such.  This  means  that  the  variance  measures  on  the 
exemplars  or  the  average  coefficient  of  variation  of  an  exemplar  ought  to  be  small.  But 
then,  is  this,  or  should  it  be  ever  under  our  control?  The  pattem-saimpling  that  takes 
place  during  the  training  period  may  include  event  instances  so  noisy  and  incomplete  that 
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for  all  practical  purpose  they  could  be  considered  garbage.  Should  we  let  the  system  be 
trained  with  such  feature  vectors?  One  could  suggest  that  if  on  inclusion  of  such  a  feature 
vector  the  class  prototype  is  perturbed  more  than  one  could  tolerate  then  one  could  aban¬ 
don  the  vector  altogether.  However,  this  means  that  to  be  reasonable  one  should  make  the 
level  of  tolerance  function  of  number  of  patterns  already  accommodated  within  the  class. 

Some  Neural  Net  Classifier  Models 

In  this  section,  we  conjectxire  some  models  which  are  obvious  extrapolation  of  some 
neural-net  based  models  already  advocated.  One  could  consider  a  known  classifier  model 
and  expand  it  to  suit  a  specific  architecture,  in  which  case  one  starts  with  the  advantage 
that  at  least  results  up  to  the  preextrapolated  modeling  state  is  known  and  could  be  con¬ 
sidered  as  a  reference  basis  for  further  studies.  However,  we  have  some  problems  in  this 
regard.  It  is  possible  that 

•  without  an  extensive  simulation  and  actual  real  case  studies,  one  may  not  be  in 
a  position  to  confidently  use  these  extrapolated  models,  even  though,  they  do 
appear  fairly  plausible  at  this  stage, 

•  no  neural  net  model  has  been  so  extensively  studied  to  understand  the  e.xtent 
to  which  the  model  is  applicable  under  realistic  noise  and  measurement  oriented 
issues, 

•  no  research  result  is  available  to  indicate  to  what  degree  the 
classification/recognition  problem,  in  the  domain  of  neural  net  paradigm,  is  sensi¬ 
tive  to  pattern  symmetries  pau'ticularly  when  one  deliberately  expands  the  feature 
space  by  systematic  topological  operations  such  as  folding,  contraction,  etc.  Note 
that  such  topological  operations  do  not,  in  general,  reduce  the  inherent  entropy 
measures  of  patterns  studied,  but  it  could  provide  some  redundancy  on  the 
metric  space  which  could  be  exploited.  It  is  not  necessarily  obvious  that  such  a 
topological  restructuring  is  always  possible  or  even  desirable  for  feature  enhance¬ 
ment. 

Accordingly,  we  propose  the  following.  Of  course,  this  has  to  be  thoroughly  tested, 
particularly  with  highly  symmetric  patterns  and  with  those  contaminated  with  noise,  but 
we  could  extend  the  feature  space  from  an  orthogonal  v-dimensional  feature  space  to  a  i/- 
dimensional  general  feature  space  contemplated  as  follows.  Given  a  pattern  on  the  original 
vector  space  as  a  tuple  {x-,},  we  consider  it  equivalent  to  the  nonorthogonal  expansion  in 
the  following  form 

Original  pattern  P  =  {xj} 

Equivalent  pattern  on  the  extended  space,  P^qi^v  =  {xj,  XjXj,  i  =#=  j,  x,x  x-,4,  i  =*=  j 
k,  ...  }. 

For  classification/recognition  purpose,  we  assume  Pequiv  —  P-  The  additional  redundancy 
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introduced  by  this  artificial  encoding  would  not  be  detrimental;  on  the  contrary,  it  would 
assist  us  to  discriminate  patterns  successfully  in  a  parallel-distributed  processing  scheme. 

A  Localized-Distributed  ARTl  Scheme: 

In  this  scheme,  the  basic  ARTl  model  of  Carpenter  A:  Grossberg  (1,5)  that  deals  with 
binary  feature  vectors  is  considered  as  the  reference  system  for  developing  a  workable 
localized-distributed  model  for  feature  recognition/classification.  This  is  outlined  as  fol¬ 
lows. 


(a)  Given  feature  vectors  "x  on  the  original  v-dimensional  basis,  we  ootain, 
for  each  pattern,  the  extended,  but  equivalent,  feature  vector  'a  as  indi¬ 
cated  earlier.  The  entire  <7-pattem  is  input  to  a  group  of  processing  unit 
modules  each  of  which  comprises  a  number  of  neuron-type  processors. 
Thus,  logically,  the  pattern  vector  "o  is  captured  by,  say,  N  PUM  (process¬ 
ing  unit  module),  each  PUM  comprising,  say,  n  neuron-type  simple  proces¬ 
sor  elements  receives,  at  time  t,  a  specific  partition  of  the  'a  pattern  for 
processing.  Thus,  the  Ith  PUM  would  receive,  at  time  t,  the  following 
ordered  input  string  for  processing 

0*1(1)  =  {o‘n,  o‘i2,  ...  ,o‘in} 


This  is  as  shown  below. 


(b)  For  each  Ith  PUM,  we  have  the  local  group  vigilance  parameter  p\ 
which  is  a  variable.  For  each  pattern  class  or  exemplar  j,  similarly,  we 
have  a  separation  measure  ttj  reflecting  the  minimum  tolerable  distance 
the  other  classes  must  be  at  given  the  pattern  class  j.  Initially, 

Pi  =  p*  -r  [1  -  \\[pr  -  Po,  0) 


and 
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TTj  >  0 


where  u(x,y)  is  a  uniformly  distributed  random  variable  between  x  and  y, 
both  inclusive,  and  f(m)  is  some  appropriate  monotonically  increasing 
function  on  the  PUM  size  m,  the  number  of  elemental  neuron  type  proces¬ 
sors  it  contains.  Also, 

p*  =  nun  {pi},  and  Pc  =  {pi} 

(c)  each  PUM,  working  independently  from  each  other,  attempts  to  iden¬ 
tify  the  best  exemplar  its  input  pattern  corresponds  to.  We  assume  that 
there  are,  n  +  1  distinct  classes  to  choose  from.  One  of  these,  the  refuse 
class,  would  correspond  to  those  patterns  which  are  too  fuzzy  to  belong  to 
any  of  the  other  n  classes.  The  exemplar  weights  are  stored  in  the  b- 
matrix.  The  dot  products  of  its  input  pattern  with  the  exemplar  patterns 
are  computed  at  this  level  as 

=  E 


where,  the  index  i  is  over  all  neuron-type  processors  comprising  the  Ith 
PUM,  and  the  index  j  points  to  a  specific  class  or  pattern  exemplar. 

(d)  The  tentative  cluster  I(j*)  for  the  Ith  PUM  conglomerate  is  obtained 
as 


I(j  )  =  index  (max 
j 

If  is  less  than  <5j’  then  the  pattern  goes  to  refuse  class, 

(e)  For  each  PUM,  we  obtain  qj  as  given  by 


ai 


I 


I 


(f)  If  ai  <  p],  then  we  deactivate  the  suggested  class  I(j*),  as  shown  by 
Carpenter  Ac  Grossberg,  and  go  back  to  (c).  Otherwise,  at  the  earliest 
possible  time,  we  post  Ifj  )  at  the  next  level  up  with  its  (aj  —  ^j)  value. 


(g)  VVe  next  compute  the  frequency  of  occurrence  m(j’)  of  the  exemplar  j* 
as  the  indicated  pattern  choice  at  the  subpattern  levels,  and  obtain 


/3(j*)  =  m(I(j’))  2 
I 


(ar  -  Pi)  Thr  (aj  -  pj) 
PI 
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where,  Thr(x)  =  1  if  x  >  0.0,  otherwise,  0. 

(h)  The  most  suggested  cluster  center  is  j  where 

j“  =  index  (mp  ) 

j 


(i)  Synaptic  weights  (both  top-down  and  the  bottom-up  weights)  are 
updated  if  /?(j  )  >  “7,  some  threshold  parameter.  Then,  referring  to  time 
instances  by  the  label  1,  we  have. 


tV-i(i+i) 

bV*i(i+i) 


=  tY-i(l) 

tV-i  (1+1) 
0.5  -h  2  tV*i  (I) 


(j)  If  the  weights  v^pdated  via  (i)  then  on  all  PUM  nodes  I  with  I(j**) 
as  its  choice,  w  >•  ^ate 

Pl(l  1)  =  avg  {pi  (1)  } 

else,  we  update  them  as 

Pi  (I  +1)  =  max  {p*  ,  ^(1)} 

The  algorithm  as  outlined  here  is  an  extension  of  the  basic  approach  articulated  as 
/^RTl  algorithm  of  Ccirpenter  and  Grossberg  (3).  The  idea  here  is  to  discover  the  global 
cluster  structure  via  cluster  formation  at  local  level.  If  any  such  algorithm  could  be  out¬ 
lined,  then  it  is  of  immense  help  to  us  since  any  pattern  structure  could  be,  in  principle, 
partitioned  into  a  number  of  parallel  processor  unit  modules  each  of  which  would  then  be 
required  to  identify  locally  at  its  level  whatever  distinguishing  pattern  it  can  see. 

A  Bland-net  Classifier: 


This  approach,  though  lacks  depth,  is  useful  as  an  alternative  cluster  determination 
procedure.  In  spite  of  the  fact  that  it  is  presented  as  a  centralized  scheme,  it  could  be 
easily  restructured  as  a  localized-distributed  scheme  as  outlined  in  the  previous  section 
with  the  ARTl  type  algorithm.  Its  basic  architecture  is  a  single-level,  or  a  single-layer  mul¬ 
tiple  neuron-type  processors  processing  the  input  string,  and  a  decision-unit,  silting  at  its 
top  working  on  the  output  from  the  neurons  below.  This  is  outlined  below. 

(a)  As  in  localized-distributed  ARTl  scheme,  extend  the  input  feature 
vector  "x  to  a  feature  vector  "o.  The  input  vector  is  now  the  ordered  string 
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a(t)  =  {a*i,  <tS, 


(b)  Assign  random  weights  to  the  synaptic  elements  bj,.  Note  that  the  ele¬ 
ments  bji  G  [0,l]. 


/>-m£s&C9S 


A  Bland-net  System 


(c)  On  an  input  pattern  to  be  classified,  compute  pattern-pattern  class 
and  pattern  class-pattern  class  distances  over  the  Maihalnobis  metric  with 
a  positive  value  for  the  A  parameter 

_i_ 

Dj  =  lE  (bji  -<Ti)^\  ^ 

i 

Qj  =  min  (bji  -  bki)' 
i 

(d)  Compute  j*  =  index  (min  {Dj}).  If  Dj*  <  6,  then  obtain 

h 

Qj 

(e)  Update  synaptic  weights  bj*i  if  the  last  pattern  is  included  in  this  class 
as 


j*  =  index  (min  { 
j 


}).  Pattern  a  is  in  class  j  . 


»i-i  ( ■'i-  + 1)  = 


where  nj*  is  the  number  of  patterns  already  accommodated  in  the  cluster 

j’- 


Classiiier  Architectures 
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Training  a  system  to  the  eventual  task  of  correct  pattern-class  identification  on  a  pat¬ 
tern  space  may  be  order-sensitive  in  some  cases.  This,  of  course,  depends  upon  the  algo¬ 
rithm  used  to  identify  clusters,  and  the  way  the  training  patterns  are  introduced  at  the 
input  and  processed  by  the  system  subsequently.  If  the  system  at  the  task  level  is  essen¬ 
tially  sequential  in  the  sense  that  the  individual  patterns  axe  sequentially  admitted  and 
processed  one  at  a  time  against  the  already  acquired  knowledge,  tentative  though  may  be, 
of  likely  class-distribution,  which,  in  turn,  is  expected  to  be  incrementally  adjusted  given 
the  current  new  input,  we  could  get  an  order  sensitive  distribution  eventually.  Ideally,  the 
input  patterns,  in  its  entirity  should  be  concurrently  processed  as  one  single  whole  without 
any  prior  reference  to  any  distribution,  so  that  after  a  given  numb  number  of  cycles,  we 
would  obtain  a  stable,  crisp,  unbrittle  distribution  of  the  sampling  population  significantly 
close  to  the  population  distribution  in  the  universe  of  discourse.  But,  if  the  patterns  arriv¬ 
ing  at  an  input  port  are  only  sequentially  admitted  as  units  of  a  single  time-dependent  or  a 
time-varying  input  stream,  we  cannot  arbitrarily  keep  them  on  hold  in  some  temporaiy^ 
buffers  to  be  released  at  some  convenient  time  for  concurrent  processing  in  bulk,  unless  the 
input  rate  is  very  high  compared  to  processing  rate. 

To  understand  the  overall  dimension  of  the  problem  in  this  regard,  we  consider  some 
simple  processing  models.  Suppose,  the  patterns  arrive  at  a  rate  of  Ap  patterns/sec  at  an 
input  port  obeying  a  Poisson  distribution  as  follows.  The  probability  that  the  number  of 
arriving  input  patterns  during  an  arbitrary  interval  of  length  t  units  is  n  is  given  by 


p(number  of  inputs  =  n  within  t) 


(W 

n! 


Suppose,  in  the  first  model,  we  have  all  the  incoming  patterns  temporarily  queued  in  a 
buffer  (infinitely  long)  till  they  are  serviced  one  at  a  time  in  a  FIFO  discipline  by  an 
appropriate  classifier  (using,  say,  ARTl  algorithm  of  Carpenter  Grossberg  (3.5l)  in 
which  the  average  individual  pattern  processing  time  is  1/Vc  sec.  We  assume  that  the  clas¬ 
sification  process  is  exponentially  distributed,  that  is  the  probability  distribution  of  the 
classification/recognition  time  is  given  by  the 


f(s)  =  Me  e  ,  s  >  0 
=  0,  otherwise 


In  this  case,  the  relevant  system  and  pattern  traffic  aggregate  measures  emerge  as  follows 

(2). 


L®,  the  average  buffer  occupancy  =  - - — 

Mr  ~  Ap 


W'\  the  average  pattern-residence  time  In  the  system  =  - 

Mr  -  Ap 


the  average  pattern  time  in  the  buffer  =  — ; - - - 

Mc(Mc  -  Ap) 
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We  should  strive  for  a  system  in  which  L®,  W®,  and  W^®  are  all  small. 

We  could  improve  the  system  somewhat  if  we  process  the  patterns  conc  urrently  by  a 
set  of  multiple  classifiers.  We  assume  that  they  are  all  equal  in  performance,  i.e.  capable  of 
classifying  a  single  pattern  in  1//^^  time  on  the  average.  There  are  at  least  two  distinct 
ways  to  process  the  patterns  now. 

Processor  Pool  Model  A: 

We  assume  that  the  system,  in  this  case,  comprises  k  identical  processors  in  a  proces¬ 
sor  pool  to  process  incoming  patterns  in  FIFO  mode  service.  In  this  case,  all  the  incoming 
patterns  wait  in  a  single  queue  served  by  k  identical  processors  in  the  M/M/k  service 
mode.  Whenever  one  of  the  k-machines  becomes  idle,  it  picks  up  the  first  pattern  it  finds 
at  the  input  queue  and  attempts  to  classify  it.  The  system  orgajiization  is  indicated  below. 


A  Processor  Pool  based 
Classifier  Model 


Given  thn^  the  tirrival  rate  of  the  patterns  into  the  system  is  still  Ap  patterns  sec. 
and  the  service  rate  at  any  of  the  k  given  classifiers  is  also  Poisson  with  a  mean  rate  of  ;.i . 
and  that  Ap  <  k  the  equilibrium  probability  po  that  the  system  is  idle  is  given  by 


f  "  *  /.  ''p  ,  1  /  ''P  V 

iS,  cj  k!  Me* 


£_  1-1 


and,  the  equilibrium  probability  p;  that  the  system  has  i  patterns  in  it  is  given  by 
Pi  =  for  i  <  k 

1.  /ic 

in  terms  of  which  the  system  aggregates  could  be  computed  as 


1-^  = 
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where,  Lq  is  the  average  number  of  patterns  in  the  processing  queue  waiting  for  one  of  the 
k  processors  to  process  them,  and, 

=  - - 1 - ,  and 

-'p  Me 
=  Ap 

where  is  the  average  residence  time  per  pattern  in  the  system,  and  L^’  respectively  is 
the  system  occupancy  rate  in  the  model  A. 

k-Farallel  Stream  model  B: 

In  this  case,  the  input  stream  is  split  into  k  streams  each  of  v.  hich  is  then  looked  after 
by  a  single  processor  with  a  processing  power,  as  before,  at  a  rate  of  completing  on  the 
average  of  one  pattern  in  every  second.  This  is  schemetically  as  shown  below. 


c/<*ss./,cu 

Parallel  Stream  or  SIMD  based  design 

At  each  arriving  pattern  stream,  we  assume,  the  interarrival  time  between  any  two 

consecutive  patterns,  on  the  average,  would  be  exponentially  distributed  with  a  mean  of 

k 

since  upon  being  available  for  processing  it  would  be  sent  down  to  one  of  the  processor 

Ap 

queue  with  a  probability  of  1/k.  In  thb  case,  the  system  aggregates  are  as  follows 


with  the  average  residence  time  per  pattern  as 


In  either  case,  whether  we  provide  a  bamk  of  multiple  servers  for  a  single  queue  pat¬ 
tern  traffic  or  a  multiple  port  service  with  each  input  port  being  serviced  by  a  single  pro- 

W  W  D 

cessor,  the  net  result  is  the  reduction  of  residence  times,  since  both  -  and  -  are  less 


than  1,  for  k  >1. 


A  Quasi-Array  Processor  Model: 

Here,  we  assume  the  system  to  be  in  one  of  the  two  mutually  exclusive  states.  We 
assume  that  it  is  either  at  a  learning  state  A  whereupon  it  trains  itself  incrementally  with  a 
stream  of  incoming  feature  vectors  on  the  assumption  that  these  vectors  do  constitute  a 
set  of  legitimate  patterns  from  the  population  of  patterns  in  a  statistical  sense,  or  it  is  in 
state  B  where  its  learning  has  stabilized  to  the  extent  of  knowing  precisely  in  how  many 
pattern  classes  the  pattern  domain  is  to  be  partitioned,  and  also,  to  a  large  extent,  how  the 
corresponding  pattern  exemplars  look  like. 

Assuming  the  system  to  be  in  state  A,  that  is  in  the  learning  state,  we  let  the  incom¬ 
ing  feature  vectors  all  collated  in  a  time  ordered  sequence  at  the  input  buffer  B  serviced  by 
a  controller.  The  controller,  periodically,  picks  up  k  feature  vectors  from  the  buffer  and 
gets  them  shuffled  at  a  Permutation  block  P  such  that  the  input  string  of  vectors  {3r(t), 
3r(t-M),  ...(^(t-hk-l)}  is  changed  to  a  random  sequence  of  vectors  of  the  form  {x(tk), 'x(t|), 
...,'x(tn,)}  at  the  output  of  P.  The  controller  then  dispatches  these  vectors  to  the  classifiers 
at  a  rate  of  one  vector  per  classifier  per  dispatch  event.  The  classifier  bank  comprises  k 
independent  classifying  units.  Each  classifier  C,  upon  receiving  the  vector T(t;)  attempts  to 
learn  from  it  using  one  of  the  algorithm  outlined.  After  leairning/classifying  m  consecutive 
patterns,  each  classifier  C;  sends  its  exemplar  profile  map  to  its  output  node  O;  where  the 
output  exemplar-profile  corresponds  to  the  pattern-class  distribution  for  the  last  m 
independent  feature  vector  events.  The  output  profile  at  0;  is  a  string  of  vectors  along 
with  their  pattern  counts  i.e.  the  number  of  patterns  it  has  accommodated  in  a  given  class, 
as  shown  below. 

V'i  (1)  =  {(Z‘i,  m'l),  (Z'n,  m'2),  (Z'h,  m‘h)} 

with  m‘j  =  the  cardinality  of  the  class  j  given  m  patterns  in  total 

indicating  that  at  the  end  of  m  consecutive  pattern  processings  the  plausible  partition  of 
the  feature-space  is  a  h-tuple  as  shown  above.  The  local  output  node  Oj  sends  this  feature 
map  to  the  next  level  up  at  the  node  Y  where  all  such  k  independent  feature  maps  arising 
from  below  are  consolidated  with  the  consolidated  feature  map  obtained  at  the  last  cycle. 

If,  on  the  other  hand,  the  system  is  at  state  B,  that  is  in  the  state  of  functioning  as  a 
pattern  recognizer,  the  controller  sends  the  input  pattern  to  be  recognized  directly  to  the 
processor  Y  in  an  M/M/c  processing  mode  assuming  the  feature  vector  arrival  process  at 
the  input  buffer  is  Poisson  and  the  pattern  recognition  time  at  Y  is  exponentially  distri¬ 
buted,  respectively.  The  service  times  at  the  processors  Cj  are  all  assumed  to  be  exponen¬ 
tially  distributed. 

We  assume  that  the  switching  over  from  the  state  A  to  state  D  takes  place  once  the 
system  is  triggered  by  one  or  more  of  the  following  conditions.  It  is 

•  The  input  stream  is  punctuated  by  a  time  gap.  After  the  time-out, 
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the  system  migrates  into  the  state  B. 


•  The  class-exemplar  profiles  at  Y  over  last  q  updates  involve 
exemplar-pattern  perturbations  whose  maximum  is  lower  than 
some  threshold  parameter. 

Note  that  at  any  sampling  event  time  during  or  after  training  episodes,  cifter  some  minimal 
time  point  tq,  the  node  Y  has  a  specific  exemplar  profile.  Even  if  one  or  more  classifier 
type  processors  fault  the  system  is  globally  fault-tolerant.  The  basic  structure  is  indicated 
below. 


Note  that  each  elemental  concurrent  processor  C;  could  be,  at  some  level,  a  feedback 
processor,  as  shown  in  the  diagratm,  in  the  event  the  pattern-class  corresponding  to  an 
input  pattern  requires  further  processing. 

Consolidation  of  local  distributions 
into  a  global  pattern-class  distribution 

At  the  node  Y,  the  global  pattern-classification  emerges  asymptotically.  That  is,  if 
the  class-distribution  profile  at  Y  at  time  t=l  is  given  by  the  distribution  0Y(t=l),  then 
limit  ^(t)  =  0Y>  assume,  does  exist  and  constitutes  the  desired  class-exemplar  pro- 

t  -•  00 

files.  We  assume  that  at  the  end  of  some  computational  cycle  1,  the  individual  class- 
distribution  profiles  together  with  their  local  frequency  of  distribution  (1),  1  <  i  <k  } 
are  amassed  from  the  output  nodes  (O;}  below.  All  these  axe  reconciled,  at  the  end  of  the 
1th  cycle,  with  the  previously  computed  global  class-distribution  OyO‘1)  ^.t  Y  as  foliows. 

0y(1)  =  ^i(l)  ^ 

0y(1)  =  (0y(1-1)  U  (  U  ^lO)  )) 
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and,  given  two  class-distribution  profiles 


(1)  =  {(^1,  m‘i),  {^2,  m'o),  (^h,  m'h)} 
(1)  =  {(Z^i,  m\),  (Z^o,  m^),  {Z\,  mJ^)} 

we  define  the  'union"  operation  of  the  above  profiles  as 


m‘nz'n  +  in*,2^r  1  • 

^»0)  u  ^jO)  =  {  - ; - □ -  *  (°^'q  +  “'r)  I  d(z'q.z^)  <  ^  } 

m'q  + 


where,  d(.,.)  is  the  "distance"  between  the  exemplar  patterns  in  the  argument.  One  could, 
if  necessary,  condition  the  union  operation  further  by  requiring  that  two  classes  may  not 
be  allowed  to  coalesce  if  the  resultant  class  were  to  emerge  as  too  big  a  class  by  size. 


Performance  Profile 


The  above  model,  as  proposed,  attempts  to  provide  a  "bias-free"  architecture  via  con¬ 
current  multistream  computation  of  the  global  class-distribution.  If  the  patterns  or  rather 
the  feature  vectors,  during  the  training  period,  arrive  into  the  system  at  a  Poisson  rate  of 


•  •  ^ '  P  ' 

Ap  vectors/sec,  each  processing  stream  then  receives  input  at  a  rate  of  — -  vectors/sec 


which  gets  processed  with  an  average  residence  time  of  W  sec.  per  pattern,  where 


where,  /xy  is  the  average  processing  rate  of  feature  vectors  at  the  level  Y. 

Note  that,  in  some  cases,  particularly  in  Carpenter/Grossberg-type  algorithms,  the 
internal  processing  by  the  elemental  processor  Cj  may  include  feedback  loop  as  shown 
below.  This  is  due  to  the  fact  that  sometimes  a  tentative  class  exemplar  may  not  resonate 
at  a  specific  class  in  which  case  one  has  to  deactivate  the  class  temporarily  and  search  for 
an  alternative  choice  by  going  back  into  the  algorithm  at  least  one  more  time. 


An  Interna]  Classifier  with  a  Feedback  Loop 

Assuming  that  with  a  probability  d,  a  task  returns  to  the  end  of  the  buffer  for  at  least 
one  more  time  of  processing,  and  with  a  probability  of  (1  -  0)  it  departs  the  classifier,  one 
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obtains,  at  equilibrium, 


-pPo  =  (1-^)  Me  Pi 
k  ^ 

(-j^  +  (l-^)Mc)Pn  =  ■jf'Pn-l  +  {l-^)McPn+l,  n  >  1 


where  is  the  equilibrium  or  the  steady-state  probability  that  the  subsystem  at  the  C, 
processor  stream  has  n  tasks  in  it  (including  one  being  processed,  if  any).  The  equilibrium 
population  densities  come  out  to  be 


Pn  =  (TtA-)"  (1 


),  n  >  0 


k(l-d)/i,  '  '  k(l-<>)AXc 
and  the  steady  state  system  occupancy  rate  comes  out  as 


00 

Li  =  E  "Pn  = 


0 


_ _ 

k(l-d)Mc  -  Ap 


while  the  expected  response  time  of  a  task  in  this  stream,  W,,  comes  out  to  be 


A  Multidepth  Pipeline 
Classifier  Processor 

Furthermore,  one  could  extend  this  architecture  to  a  multidepth  pipeline  organization 
indicated  below  in  which  at  a  certain  depth  ^  either  a  feature  vector  is  processed  to  the 
extent  of  knowing  in  which  class  it  best  belongs  or  it  is  not,  in  which  case  the  pattern  vec¬ 
tor  is  passed  onto  the  next  level  classifier  working  at  a  finer-grain  resolution  at  a  depth  (^‘ 
-r  1).  Assuming  that  with  a  probability  of  at  a  level  ^  a  feature-vector  is  referred  to  the 
next  level  processor,  and  with  a  probability  of  it  loops  back  for  further  processing  at  the 


level  and  assuming  that,  for  all  practical  purpose,  the  system  is  decomposable  as  multi¬ 
ple  M/M/c  subsystems,  working  in  tandem,  we  have  the  effective  pattern  arrival  rates  in 
each  subsystem  at  the  depth  as,  at  equilibrium. 


-  C? 


in  terms  of  which,  the  expected  residence  time  of  a  pattern  task,  at  each  pipeline  section 
comes  out  to  be 


W 


i  - 


1 


Note  that  an  infinite  variety  of  parallel  distributed  architecture  for  neural-based  net¬ 
work  systems  could  be  proposed.  In  almost  all  cases,  multiple  neuron-type  processors  are 
best  suited  to  provide  SIMD  type  architecture  at  some  computational  level.  Organization 
of  machines  based  on  SIMD  type  primitives  as  indicated  above  yield  substantial  perfor¬ 
mance. 
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DISCRETE  FOURIER  TRANSFORMS  ON  HYPERCUBES 


This  paper  presents  a  decomposition  of  the  discrete  Fourier  transform  matrix  liiai  is  mori 
explicit  than  what  is  found  in  the  literature.  It  is  shown  that  such  a  decomposition  forms 
an  important  tool  in  mapping  the  transform  computation  onto  a  parallel  architecture  such 
as  that  provided  by  the  hypercube  topology.  A  basic  scheme  for  power-of-a-prime  length 
transform  computations  on  the  hypercube  is  discussed. 

The  use  of  the  Chinese  Remainder  Theorem  in  pipelining  the  computation  of  llic 
transforms,  for  lengths  that  are  not  a  power  of  a  prime,  is  also  discussed 


1.  A  DFT  matrix  decomposition 

Recall  that  the  discrete  Fourier  transform  Y  of  length  n  of  a  vector  X  of  length  n  i- 
defined  by 

Y  =  F  X 

where  F  is  the  square  matrix  of  order  n  given  by 

F  =  (  W^j  ] 

with  0<k,j<n-l  and  where  W  is  the  nth  root  of  unity 

W  =  e  =  cos(  27r/n  )  -  i  sin  (  27r/n  ), 

with  i^  =  -  1. 


It  is  claimed  that  if  n  =  2*’^,  then  the  DFT  matrix  F  is,  up  to  a  permutation  matrix,  the 

product  of  7  =  log  n  sparse  matrices 

F  ~  F  F  F  F  F 
^  2  -  '7-2  7-1 

where  each  F^  j  is  a  block  diagonal  matrix 

F  .  =  diag(F  .  F  ■  ...  F  .,i-l  j 

7-1  7-1,0  7-1,1  7-i,J  -1 

and  the  following  properties  hold 

(i)  each  block  F  .  is  square  of  order  2^”^'^'*’^=  n/2‘'^  and  is  of  the  form 


F 

^  'T-»,o 


/ 

I 

and  there  are  exactly  2*'^  blocks;  I  is  the  identity  matrix  of  order  n/2'; 

(ii)  for  each  block,  r  is  the  nonnegative  integer  whose  7-bit  binary  expansion  is  the  reversal 
of  that  of  2a. 


The  proof  of  this  decomposition  can  be  sketched  as  follows. 

We  write  the  components  of  Y  and  X  as  y[k)  and  x[jj,  for  0  <  k,  j  <  n  -  1,  so  that 

y[^!= 

7=0 


Hence,  with  indices  written  in  binary,  we  get 
y[kl  =  •  •  •  ki^rol 

*DI  =  *[y7-lA-2  •  •  •  JiJo] 

y/ki  _  ■  -  •  +2*,+ir,)(2^V,-,+2^-'y,_, - 

11  11 

y[Vi/r,_2  •  •  •  ki/fol  =  E  (  E  (...{  E  (  E  ^U^-iA-z  ' '  JxJq\ 

>0=0  y,=o  y^2=o  y,_i=o 


4S 


where 


'  +2A'i  H- A’o  )2 

W^_2  =  ■ ' 

•  * +2Jti+J(ro)2' 

•  •  ■  +2*i+*ro)2yi 


Wq  =  •  •  •  +2ifi+<fo);o 


Since  =  1,  we  actually  get 


W^l  = 
2  = 


^  ■  ■  ■  +2/fi+*o)2/i 

Wq  =  -r-i+2‘^^*^2+  ■  ■  +2iri+*o)yo 


The  computation  can  then  be  carried  out  in  stages,  where  at  each  stage  we  calculate 
intermediate  vector 


Note  that  in  our  notation,  the  first  vector  to  be  computed  is  ^  and  the  IcLst  is  X^. 


an 
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From  the  expansion  of 


y[k]  -  y[k^_jk^_2..-l<^kQ: 

we  see  that  the  components  of  the  vectors  X^j  and  X^j_|_j  are  related  by  the  equations 

•  •  ■  /fy-3^/-2^/-l7V/-lA-/-2  *  ‘  ■  ^1^1  = 

1 

2  ■  *  •  ki^2^i—2J'i—iJ^—i—lJ~t—i—2  '  '  ’  J\j^\ 

ji-i=Q 

—  Jf«y— /+l[^0  ‘  *  *  ^i—Z^i—2  ®  J'j—i—lJ^—i—2  '  71^0] 

^2'r-(2'-i/f;_j+2'-*<f,_2+...+<ro)  ■  0  ^ 

•*7— /+l[^0  ‘  "  "  ^i~2^i—2  ^  J'^—i~\J'^—i—2  '  '  ^170] 

j^'^(2'“‘4r..,+2‘-**,_r+...+*o)  •  1 

—  ■  *  ■  ^|_3^/_2  ®  J'i—i—lJ'i—i—2  ’  ’  *  7*170] 

•^7— /+ll^0  ’  '  '  ^i—Z^i—2  ^  7*7— — 17*7— /— 2  7l70] 

l^2^(2'-‘*i_j+2'-^ir;.2+...+<ro) 

i.e  we  may  write 

“T-l  7-1  7-1+1 

for  some  appropriate  matrix  F^j. 

From  the  last  expression  above,  we  see  that  the  matrix  F^j  can  be  blocked  into  2 

submatrices.  Each  of  the  submatrices  is  determined  by  a  unique  value  of  the  bits  rfiat  , 

more  significant  than  the  (i-l)th  bit,  namely 

k,  ...  k.  ,  k.  « 

0  1  1-3  1-2 
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VVe  use  these  bits  as  our  label  a  for  each  block,  so  the  decimal  value  of  a  is 

2  kp  +  2  kj  +  ...  +  2k._2  +  kj_2 

The  size  of  each  block  F  .  is  determined  by  a  unique  value  of  the  bits  that  are  less 

7-1, a  ^ 

significant  than  those  of  the  label  a.  We  can  see  that  there  are  7-i+l  such  bits,  hence  the 
siieofF^j^^is 

2ri+l  ^  21/2'-’  =  n/2'-' 


We  next  need  to  show  that 


F  .  is  given  by 

7-1,0:  °  ■’ 


I  W^I 

n 

r-l-~ 

I  W  2  / 


For  a  fixed  o  =  2‘*^k^  +  2*'^,  +  ...  +  2k.  o  +  k.  the  two  possible  values  for  k. 

0  1  1-3  1-2’  ^  1 

the  two  halves  of  the  matrix  F  •  ^ 

7-1, a. 

Indeed  when  kj_j  =0  then 


H' 


■  '  ■  ^i-Z^i-2  0  "  ’  '  yiVo]  = 

'  ■  ^i-Z^i-2  0  ■  Viyol  ~ 

1  J-r-i-XJ-j-i-2  '  '  JlJo]  ' 

y^2^{0+2'  ■  ■  ■  +^o) 


Now  the  submatrices  I  and  W*^I,  where  I  is  the  identity  matrix  of  order  n/2'  are  evident, 
where  r  is  given  by 


A  similar  analysis  for  kj  ^ 


r  4-9‘'^k  -1-  .i-Ol-  ^L-  \ 

^  i-2  “  i-3  -  “*'1  ‘'0' 

=  1  shows  the  two  submatrices  in  the  second  half  of 


F  . 

7-1,0. 
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It  remains  to  show  that  the  ^y-bit  binary  expansion  of  r  is  that  of  the  reverse  of  2a. 
Note  that 


r  =  2^-2  +  ...  +  2T'-'+^^i  +  T-‘kQ 

Of  =  2*  ^/fg  +  “2!  +  ...  +  2/f/_3  +  kj_2 

2.0.  —  2'  ^/fg  +  2'  ‘'ki  +  +  2^kj^2  "H  2ki_2 

The  7-bit  binary  expamsion  of  2a  is  then 

27-1  .  0  +  2'’-^  •  0  +  ...  +  2'  •  0  +  2'-^/fg  +  ...  +  l-'^k^  +  ...  -f 
+  2^ki.2  +  2°  ■  0 

Reversing  the  bits  of  2a  amounts  to  making  the  following  substitutions 

2^*, -3  ^  2’-=fc,-.3 


2''-‘*o  ->  2*'^‘'->'-'l*o  =2^-''*o 


which  yields  the  number 


which  is  indeed  r. 
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Note  that  because  of  the  way  the  indices  are  ordered,  at  the  end  we  get 

ylk  ,k  „...k,k„i  =  Xp,lk„k,...k  ,,k 

-7-1  7-2  10'  O'  0  1  7-2  7-1' 

Hence 

F  =  R27  Fo^iF2  - 

where  R27  is  the  permutation  matrix  corresponding  to  the  permutation  of  the  numbers 

0,  1,  ...  ,  n-1  =  2^-1 

defined  by  reversing  the  bits  in  the  binary  representation. 

2.  The  hypercube  topology 

The  decomposition  of  the  DFT  matrix  into  a  product  of  sparse  matrices,  as  shown  in 
section  1  provides  the  essential  tool  for  mapping  the  computation  of  the  DFT  on 
multiprocessor  systems.  In  this  section  we  examine  the  case  of  a  hypercube. 

The  hypercube  topology  has  now  appeared  in  successful  commercial  multiprocessor 
systems,  including  the  Intel’s  iPSC,  Amatek’s  Sl4,  Ncube’s  NCUBE  systems  and  the 
Connection  Machines  from  Thinking  Machines.  The  hypercube  has  a  simple  recursive 
definition: 

(a)  a  0-cube,  with  2^  =  1  node,  is  a  uniprocessor; 

(b)  for  d  >  1,  a  d-cube,  with  2^  nodes,  is  obtained  from  two  copies  of  a 
{d-l)-cube  by  connecting  each  pair  of  corresponding  nodes  with  a 
communication  link. 

Note  the  very  natural  labeling  of  the  nodes  of  the  d-cube:  simply  precede  the  labels  of  the 
nodes  from  the  first  copy  of  the  (d-l)-cube  with  0,  and  those  from  the  second  copy  with  a 
1  and  connect  nodes  that  differ  only  in  the  first  bit. 


0-cube 
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1-cube 


2<ube 


3-cube 


The  hypercube  is  attractive  for  several  reasons,  including  the  following 

(a)  many  of  the  classical  topologies  (rings, trees, meshes, etc.)  can  be  embedded  in 
hypercubes  (hence  previously  designed  algorithms  may  still  be  used); 

(b)  the  layout  is  totally  symmetric  (by  the  way,  the  hypercube  appears  among  proposed 
architectures  designed  from  finite  algebraic  groups); 

(c)  the  communication  diameter  is  logarithmic  in  the  number  of  processors: 

(d)  the  topology  exhibits  a  reasonable  behavior  in  the  presence  of  faulty  processors  oi 
communication  links; 
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there  are  several  paths  connecting  one  node  to  ajiother;  m  fact  the  number  of  disjoint 
paths  of  minimum  distance  from  node  A  to  node  B  is  the  Hamming  disianc*'  Ix'tween  tin 

(binary)  labels  of  A  and  B; 

(e)  a  hypercube  is  Hamiltonian,  i.e  a  ring  with  2^^  nodes  can  be  embedded  in  a  d-cube; 

(f)  many  algorithms  for  diverse  fields  such  as  the  physical  sciences,  signal  processing, 
numerical  analysis,  operations  research  and  graphics  have  been  successfully  designed  for 
the  hypercube. 

3.  Example  of  a  transform  computation  mapping  onto  a  hypercubc 

A  decomposition  of  the  DFT  matrbc  into  a  product  of  sparse  matrices  is  visualized,  in  tlu 

3 

usual  manner,  by  means  of  a  signal  flow  graph.  For  the  length  n  =  2  ,  the  decomposition 
of  section  1  yields  the  following  graph. 


X  X,  X,  X, 


where  the  unscraunbling  of  the  index  bits  is  not  shown.  Each  column  of  nodes  in  the  graph 
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corresponds  to  an  intermediate  vector  X 


.  of  section  1. 
1-i 


Suppose  now  that  a  2-cube  is  available.  Looking  at  the  signal  flow  graph,  we  can  allocate 
the  first  2  components  (indexed  tXX)  and  001)  to  processor  00,  the  next  two  components 
(indexed  010  and  011)  to  processor  01,  etc.  In  general  two  components  indexed  abO  and 
abl  aie  allocated  to  processor  ab,  as  shown  in  the  following  figure. 


We  see  that  the  computation  requires  3  steps  (if  we  ignore  the  bit  reversing);  during  the 
first  2  steps,  interprocessor  communication  is  required;  during  the  last  step,  only  local  data 
is  used  in  each  prcces.sor,  .so  there  is  no  nie.ssage  passing  between  tlu'  procc'ssors. 
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4.  The  general  case 

The  above  example  generalizes  to  the  case  of  a  length  n  =  2  '  Iransforrn,  for  uhic  li  k- 
cube  is  available.  Each  processor  is  assigned  a  k-bit  binary  label.  The  result  of  section  1 
allows  us  to  allocate  components  of  X  and  any  Intermediate  vector  j  as  follows:  each 
processor  will  have  2^/2^  =  2^^  components  at  any  stage;  the  processor  labeled  p  is 
allocated  the  components  with  indices 


-H  1,  ...  ,  p.2 


7-k 


+ 


i.e  if  the  processor  label  is 


P  -  Pk-lPk-2-PlPo 


then  the  prcKessor  works  on  the  components  whose  indices  are 


Pk-lPk-2-Pl'’0°“  ■“ 
Pk-lPk.2  -PlPo'>“- 
Pk-lPk-2-PlPo“°'‘® 
Pk-lPk-2'PlPo“®-“ 


Pk-lPk-2-PlPo“‘ 

Pk.lPk-2  -PlPo‘“ 

From  the  discussion  in  section  1  it  should  be  clear  that  the  computation  can  be 
decomposed  into  7  stages.  The  first  k  stages  require  interprocessor  communication  and  the 
last  7  -  k  stages  use  local  data  only.  During  stage  i,  where  1  <  i  <  k,  each  processor  labeled 

P  ~  Pk-lPk-2”  PlPo 

communicates  with  and  only  with  the  processor  who.se  label  differs  from  p  only  in  ti;r  i;;: 

bit  from  the  left. 
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5.  Decomposing  with  the  Chinese  Remainder  Theorem 

The  scheme  presented  in  the  previous  sections  works  for  a  length  that  is  a  power  of  2.  a::i; 
should  generalize  easily  to  any  power  of  a  prime  integer.  The  next  most  obvious  question  is 
how  to  decompose  the  computation  if  the  length  is  not  the  power  of  a  prime.  In  this  case, 
a  possible  answer  is  provided  by  the  prime  factorization  of  n 


Oi  Otj  Ot-1  O* 

n  =  Pi  P2  Pk-I  Pk 


and  the  ring  isomorphism  provided  by  the  so-called  Chinese  Remainder  Theorem  (CRT  ). 


The  CRT  states  that  if  the  integers  n^,  —  ,  nj^  are  pairwise  coprime,  the  ring  ami 

the  product  ring 

X  X  ...  X  Z 
"l  2  k 

are  isomorphic  where  n  is  the  product  of  the  nj,  and  the  arithmetic  in  the  rings  is  ordin.a.-;. 

modular  arithmetic  (in  the  product  everything  is  computed  componentwise). 

The  forward  isomorphism 

Z  Z„  X  Z„  X  ...  X  Z 
n  nj  n2 

is  given  by 

X  — *  (  X  mod  n^,  x  mod  n2,  ...  ,  x  mod  nj.  ) 

The  inverse  mapping  is  given  by 

(  a^,  a2,  ...  ,  aj^  )  — ♦  X 

where 

X  =  [  ajUj(Uj*^  mod  n^)  +  a2U2(u2'^  mod  n2)  +  .  .  •  +  aj^Uj^{uj^'^  mod  nj^)  j  mod  n 
where  u.  =  n/n.  and  u."^  is  the  inverse  of  u-  modulo  n.  (which  must  exist  since  u.  and  n. 
are  clezirly  coprime). 

The  essence  of  the  isomorphism  is  that  if  n  =  n^n.,...nj^,  where  the  n-^  are  pairwise 

then  an  integer  between  0  and  n-1  may  be  thought  of  as  a  k-tuple  of  iiUt'gers,  tia-  i'r;  ' 

between  0  and  n^^-l,  the  second  between  0  and  02*1,  ...  ,  the  last  between  0  and 


55 


Now  the  relationship 


n  - 1 

y[a]  =  2  A<^]  ^ 

6=0 


nil 


may  be  written 


111— Ifij— 1  n*— 1 

y(ai,02.  •  •  ‘  .a*)  =  E  E  •  “  E  W' 

61=062=0  64=0 


(“l.nz. 


The  product 

(<*1,02,  •  •  •  ■  ■  ■  if>k} 

is  computed  in  the  product  ring 

Z  X  Z  X  ...  X  Z 

"1  "2  "k 

as  follows 

(ai,a2>  ■  ’  ■  ■  ■  ■  1^*)  ~  {<*i^i>^2^2>  ■  ■  ■  i^k^k) 

=  ai6i(l,0,0,...,0)+a262(0,l,0,...,0)+  •  *  '  +a*6jfc(0,0,0,...,l) 

hence  if  we  let 

W  =  . 0,1,0 . 0) 

where  the  1  is  at  the  ith  position,  then  y  (01,00,  •  •  ’  ,a*) 

.  .  . 

L  \  M/"****!  ... 


Note  that  W.  is  an  njth  root  of  unity. 

The  computation  of  vector  Y  may  thus  be  accomplished  in  k  stages. 


Hl-lno— 1 

E  E 

n*-l 

•  ■  •  E  2:(6i,62 

. 6, 

b  1  =062=0 

6i=0 

ni-1 

712  **  1 

n*-l 

E 

‘'(E 

•  •  (  E  ^(*1.^2 

6i=0 

62=0 

6t=0 

STAGE  1 


n»-l  ^ 

f^(6..6,.  --,6..,.a*)  =  E  •  •  •  ,6*)VV?'' 

6»=0 


S6 


■  ,  at)(6i 


STAGE  2 


"*-1-1  , 
ffl 

E 

<•*-1=0 


"*-«+! -I  ^ 

E  ^<*l-<*2-  '  ■  •  <’*-i+i- a* -1+;-  '  -at) 

<•*  —>+1=0 


-1  <'*  -I  -1 
-1 


STAGE  k 


^(«1.«2-  •  •  •  .«t-i.«*)  ~  E  •  •  •  .8*)  *^1 

6i=0 


Stage  i  requires  the  computation  of  n/nj^  DFTs  each  of  length  k-i-^1.  These  DFTs  tna\ 
be  computed  in  parallel,  once  stage  i-1  has  been  completely  finished.  Note  that  if  .\I.  lrr'-:i 

Aj)  is  the  number  of  multiplications  (resp.  additions)  required  for  the  nptransforin,  then  in 

*  A/.  *  A, 

total  the  n-transform  requires  n  .  -  multiplications  £md  n  .  V  —  additions. 

i  =  l  1*^1 

6.  Future  work 

We  need  to  find  decompositions  of  the  DFT  matrix  that  allow  better  mappings  of  the 
computation  onto  the  hypercube.  The  scheme  discussed  in  sections  1  through  -1  for 
example  leave  the  load  bctwi'en  the  processors  unbalanced  (some  ot  t  lie  jiri h  e^-^or-'  n.r. .  e 
perform  multiplications  while  others  do  not). 

The  decomposition  of  section  5  suggests  a  combination  of  pipelining  and  multiprocessing 
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that  needs  to  be  formalized.  It  also  calls  for  the  (parallel)  study  of  transforms  of  short 
lengths. 
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Conclusions 


As  stated  at  the  outset  there  are  several  avenues  that  we  are  interested  in  pursuing 
further.  Of  these,  the  use  of  algebraic  transforms  and  associated  coding  techniques 
in  the  design  of  neural-net  based  classifiers  stands  out  as  the  area  where  we  should 
focus  research  efforts  ne.xt.  We  want  to  come  up  with  classifiers  capable  of  real 
time  response  and  having  low  error  rates.  As  part  of  this  effort  we  need  to  identify, 
using  coding-theoretic  knowledge,  transforms  suitable  for  operating  environments 
which  have  different  design  requirements:  large  number  of  classes,  large  block 
length,  high  fault  tolerance.  After  establishing  that  the  proposed  family  of 
architectures  are  worthwhile  pursuing,  we  will  next  study  how  to  make  them  more 
flexible  by  considering  their  concatenation  with  learning  architectures  to  allow  for 
the  generation  of  new  classes,  this  being  achieved  without  prohibitively  increasing 
the  overall  response  time  of  the  composite  adaptive  network.  The  last  phase  of  the 
proposed  effort  involves  the  computation  of  the  error  rates  for  the  most  promising 
of  the  resulting  classifiers.  Complete  simulations  need  to  be  carried  out  on  realistic 
classification  problems. 

Concurrent  with  this  report  we  are  submitting  a  proposal  for  a  follow-up  project 
based  on  these  considerations. 
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MULT1LA\'ER  TRANSFORM-BASED 


NEURAL  NET  CLASSIFIERS 

1.  SUMMARY 

This  proposal  is  seeking  support  for  the  investigation  of  neural  network  classifiers 
based  on  discrete  transforms  whose  kernels  can  be  identified  with  linear  and  nonlinear 
algebraic  binary  codes.  More  precisely,  the  study  will  concentrate  on  the  feasibility  of 
designing  and  implementing  nonclassical  classifiers  that  are  multilayer  neural  nets, 
where  one  or  more  layer  transforms  points  in  the  pattern  space  into  points  in  some 
appropriate  "frequency"  domain,  and  where  other  layers  classify  the  latter  p(')ints 
through  attributes  of  the  transform  used.  Concatenation  of  the  proposed  networks  with 
exemplar  classifiers  (such  as  adaptive  resonance  theory  classifiers)  will  also  be 
experimented  with. 


2.  PROJECT  OBJECTIVES 

The  main  objective  of  the  project  is  to  produce  classifiers  with  low  error  rates  and  real¬ 
time  response.  These  two  attributes  are  viewed  by  researchers  as  the  most  fundamental 
characteristics  of  classifiers  with  regard  to  the  application  of  pattern  classification  to 
real-world  problems  such  as  speech  processing,  radar  signal  processing,  vision  and 
robotics.  This  project  will  explore  the  possibility  of  achieving  a  low  error  rate  by  viewing 
patterns  in  "frequency"  domain. 

Mappings  from  pattern  space  to  frequency  domain  that  are  continuous  must  be 
devised,  where  continuous  means  that  patterns  that  are  close  under  the  pattern  space 
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metric  will  have  their  images  close  under  the  Hamming  distance.  The  I'rcqucncv 
domain  will  then  appear  as  a  code  space  and  pattern  classification  might  be  thought  of 
as  error  correction  in  this  space. 

In  the  first  phase  of  the  project,  the  codewords  will  be  "hard-wired"  among  the 
interconnections  of  a  neural  network.  The  lower  layers  of  this  net  will  implement  the 
foregoing  continuous  mapping.  They  transform  pattern  vectors  into  noisy  versions  of 
the  codewords  corresponding  to  the  class  labels.  The  upper  layers  of  the  net  will  use 
coding-theoretic  properties  in  code  space  to  correct  errors  in  the  corrupted  codewords, 
thus  identifying  the  class  labels.  Because  of  the  hardwiring,  the  training  time  and  the 
operational  response  time  should  be  relatively  very  short,  the  latter  corresponding  only 
to  the  propagation  delay  among  the  layers  of  the  neural  net.  Since  it  is  not  expected 
that  a  single  transform  (i.e  algebraic  code)  will  accommodate  all  classification 
situations,  the  identification  of  suitable  transforms  and  their  matching  with  different 
operating  environments  (large  number  of  classes,  high  fault-tolerance,  high  pattern 
vector  length,  etc.)  will  also  be  studied.  The  large  body  of  knowledge  about  algebraic 
codes  and  decoding  procedures  will  be  put  to  use.  It  should  be  noted  that  many 
transforms  used  in  signal  processing  do  arise  from  appropriate  algebraic  codes;  for 
example,  the  Walsh-Hadamard  transform  uses  the  celebrated  Hadamard  code. 

The  next  phase  of  the  project  will  investigate  the  possibility  of  introducing  more 
flexibility  into  the  above  architectures.  Qearly  for  this  architecture  the  class  labels 
must  be  known  beforehand,  the  choice  of  codewords  that  identify  the  labels  must  be 
built  into  the  net,  and  learning  from  e.xamples  is  nonexistent.  By  concatenating  tb.c 
proposed  architecture  with  learning  architectures,  a  fine-tuning  of  the  parameters  of 
the  classifier  may  be  performed.  This  approach  can  yield  a  high  payoff  in  terms  of  error 
rates,  but  of  course  the  training  time  and  operating  response  time  might  significantly 
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increase. 


The  last  phase  of  the  project  will  address  the  computation  of  the  error  rate  for  the 
proposed  classifier  architecture  in  order  to  evaluate  its  place  among  proposed  and 
implemented  classifiers.  A  difficulty  that  arises  here  is  that  the  continuous  mapping 
used  for  going  from  pattern  space  to  code  space  may  introduce  additional  errors 
(noise). 

Throughout  the  project,  simulations  will  be  conducted.  Realistic  examples  of 
classification  by  neural  nets  are  well  known  to  require  a  large  amount  of  memory  and 
interconnect  computation. 

It  will  be  interesting  to  find  out  the  tradeofis  that  can  be  achieved  through  the 
proposed  architecture. 
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MISSION 


ROME  LABORATORY 

Rome  Laboratory  plans  and  executes  an  interdisciplinary  program  in  re¬ 
search,  development,  test,  and  technology  transition  in  support  of  Air 
Force  Command,  Control,  Communications  and  Intelligence  (C  I)  activities 
for  all  Air  Force  platforms.  It  also  executes  selected  acquisition  programs 
in  several  areas  of  expertise.  Technical  and  engineering  support  within 
areas  of  competence  is  provided  to  ESD  Program  Offices  (POs)  and  other 
ESD  elements  to  perform  effective  acquisition  of  C  l  systems,  in  addition, 
Rome  Laboratory's  technology  supports  other  AFSC  Product  Divisions,  the 
Air  Force  user  community,  and  other  DOD  and  non-DOD  agencies.  Rome 
Laboratory  maintains  technical  competence  and  research  programs  in  areas 
including,  but  not  limited  to,  communications,  command  and  control,  battle 
management,  intelligence  information  processing,  computational  sciences 
and  software  producibility,  wide  area  surveillance/sensors,  signal  proces¬ 
sing,  solid  state  sciences,  photonics,  electromagnetic  technology,  super¬ 
conductivity,  and  electronic  reliability/maintainability  and  testability. 
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